V-FOLD CROSS-VALIDATION AND V-FOLD PENALIZATION IN 
LEAST-SQUARES DENSITY ESTIMATION 



SYLVAIN ARLOT AND MATTHIEU LERASLE 



Abstract. This paper studies V-fold cross-validation for model selection in least-squares density 
estimation. The goal is to provide theoretical grounds for choosing V in order to minimize the 
least-squares risk of the selected estimator. We first prove a non asymptotic oracle inequality for 
V-fold cross-validation and its bias-corrected version (V-fold penalization), with an upper bound 
decreasing as a function of V. In particular, this result implies V-fold penalization is asymptotically 
optimal. Then, we compute the variance of V-fold cross-validation and related criteria, as well as 
the variance of key quantities for model selection performances. We show these variances depend 
on V like 1 + 1/(V — 1) (at least in some particular cases), suggesting the performances increase 
much from V = 2 to V = 5 or 10, and then is almost constant. Overall, this explains the common 
advice to take V = 10 — at least in our setting and when the computational power is limited — , as 
confirmed by some simulation experiments. 



1. Introduction 

Cross-v alidation methods ar e widely used in statistics, for estimating the risk of a given statistical 
estimator [Sto74l . IA1173 . lGei75l j and for selecting among a family of estimators. For instance, cross- 



validation can be used for model selection, where a collection of linear spaces is given (the models) 
an d the p roblem is to choose the best least-squares estimator over one of these models. We refer 
to |ACld | for more references about cross-validation for model selection. 

Then, a natural question arises: which cross-validation method should be used for minimizing 
the risk of the selected estima tor? For instance, a popular family of cross-validation methods is 



V-fold cross-validation |Gei75l . often called /c-fold cross-validation], which depends on an integer 
parameter V, and enjoys a smaller computational cost than other classical cross-validation methods. 
The question becomes (1) which V is optimal, and (2) can we do almost as well as the optimal V 
with a small computational cost, that is, a small V ? Answering the second question is particularly 
useful for practical applications where the computational power is limited. 

Surprisingly, few theor etical r esults exist for answering these two questions, especially with a non 
asymptotic point of view jACldl ]. In short, previous results in least-squares regression show that at 
first order, V-fold cross-validatio n is su boptimal for model selection if V stay s bound ed, because 
V-fold cross-validation is biased Arl08l | . When correcting for the bias |Bur89l . Arl08l | . we recover 



asymptotic optimality whatever V, but without any theoretical res ult dis tinguishing among values 
of V in the non asymptotic second order terms in the risk bounds [ArlOa . oracle inequality]. 

Intuitively, if there is no bias, increasing V should reduce the variance of the V-fold cross- 
validation estimator of the risk , hence a smaller risk for the final estimator, as confirmed by some 
simulation experiments [Aril, for instance]. But variance computations for unbiased V -fold m eth- 



ods have only been made in a very specific regression setting and they are asymptotic (Bux89|. 

This paper aims at providing theoretical grounds for the choice of V by two means: a non- 
asymptotic oracle inequality with a second order term depending on V (Section [3]) and exact 



Key words and phrases. V-fold cross-validation, density estimation, model selection, penalization. 

1 



variance computations shedding light on the influence of V on the variance (Section |3J). In partic- 
ular, we would like to understand why the common advice in the literature is to take V = 5 or 10, 



based on simulation experiments [HTF09I . for instance]. 

The results of the paper are proved in the least-squares density estimation framework, because 
we can then benefit from explicit closed-form formulas and simplifications for the V-fold criteria. 
In particular, we show V-fold cross-validation and all leave-p-out methods are particular cases of 
V-fold penalties in least-squares density estimation (Lemma [TJ. 

The first main result of the paper (Theorem [1]) is an oracle inequality with leading constant 
1 + £ n (V) with £ n (V) — > when the sample size n goes to infinity (for unbiased V-fold methods) 
and e n (V) decreasing as a function of V. To the best of our knowledge, Theorem [T] is the first 
non asymptotic oracle inequality for V-fold methods enjoying such properties: the leading constant 
l + o(l) is new in density estimation, and the fact that e n decreases with V has never been obtained 
whatever the framework. Theorem [T] relies on a new concentration inequality for the V-fold penalty 
(Proposition H]) with deviation terms that decrease when V increases and are sharp, in some cases 
at least. 

The second main result of the paper (Theorem [2]) are the first non asymptotic variance compu- 
tations for V-fold criteria that allow to understand precisely how the model selection performance 
of V-fold cross-validation or penalization depend on V. Previous results only focused on the vari- 
ance of the V-fold criteron |Bur89l . ICel08l . ICelld . ICR08l ]. which is not sufficient for our purpose, 



as explained in Section HI In our setting, we can then explain theoretically why taking V > 10 
is not necessary for getting a performance close to the optimum, as confirmed by experiments on 
synthetic data in Sectional 

2. Least-squares density estimation and definition of V-fold procedures 

This first section introduces the framework of the paper, the main procedures studied, and some 
useful notation. 

2.1. General statistical framework. We observe a sample £i :n = (£i, G X n of n indepen- 

dent random variables with common distribution P. We assume P has a density s with respect 
to some measure \i on X and s G L 2 (/j,). The goal is to estimate s from £ 1:n , that is, to build an 
estimator s = s(£i :n ) G L 2 (fi) such that its quadratic risk \\s — s|| 2 is as small as possible, where 
for any t G L 2 (fi), \\t\\ denotes its L 2 (/Li)-norm: ||i|| 2 := J x t 2 dfi. 

Projection e stimators are among the most classical estimators in this framework, see for example 
Given a linear subspace S m of L 2 (/i) (called a model), the projection estimator of 



s onto S m is defined by 

(!) Sm ■= argmin te5m j ||t|| 2 - 2P n (t) j , 

where P n = n~ l X^=i % IS the empirical measure and for any function t G L 2 (fj,), P n (t) = f tdP n = 
n l SILi * (&)■ T ne quantity minimized in the definition of s m is often called the empirical risk, 
and can be denoted by 

P nl (t) = \\t\\ 2 - 2P n {t) where Vx G X , Vt G L 2 (/i) , 7 (t; x) = \\t\\ 2 - 2t(x) . 

The function 7 is called the least-squares contrast. 

2.2. Model selection. When a collection of models {Sm)m<^M n is given, the model selection prob- 
lem |Mas07l | consists in choosing from data one among the corresponding projection estimators 



(sm)m£M n - The goal is to design a model selection procedure fa : X n \-t Ai n so that the final 
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estimator s := 'sfh has a quadratic risk as small as possible, that is, comparable to the risk of the 

2 

oracle inf mg _yv( n \\s m — s\\ . More precisely, we aim at proving an oracle inequality of the form 



<C n inf \ \\s m - s\\ 2 \ + R n 



meMn 

with large probability. As long as the remainder term R n is negligible in front of the risk of the 
oracle, the main goal is to minimize the leading constant C n , that should be close to 1, for the 
procedure m to be optimal. 

In this paper, we focus on model selection procedures of the form 

m := arg min {crit(m)} , 

m£M n 

where crit : M n i— >• R is some data-driven criterion. Since our goal is to satisfy an oracle inequality, 

critid(m) = \\s m - s\\ 2 - \\s\\ 2 = -2P(s m ) + ||s m || 2 = Pj{s m ) . 
is an ideal criterion. 



A p opular way of designing a model selection criterion is penalization BBM99I. IBM97I . iBMOl 



Mas07l ]: 

crit(m) = P n j(s m ) + pen(m) , 

for some penalty function pen : M n — > R, possibly data-driven. From the ideal criterion critjd, we 
get the ideal penalty 

pen id (m) := crit id (m) - P n l (s m ) = (P - P n )^ (s m ) 

= 2{P n - P)(s m ) = 2{P n - P)(s m - s m ) + 2(P n - P){s m ) , 

where 

s m ■= argmin te5m {P-y(t)} = argmin te5m | \\t - s|| 2 } 
is the orthogonal projection of s onto S m C L X (P). 

2.3. y-fold cross validation. A standard approach for model selection is cross-validation. We 
refer the reader to |AC10l j for references and a complete survey on cross-validation for model selec- 
tion. This section only provides the minimal definitions and notation necessary for the remainder 
of the paper. 

For any subset A C {1, . . . ,n}, with cardinality a, let 

P ^ A) ■= - a Y.^ 4 A) := argmin t65m { ||t|| 2 - 2P( A \t) } , 

Pn A ^ = Pn A ^ and sin A ^ = Sm \ where A c = { 1, .., n} \A denotes the complementary of A. 

The main idea of cross-validation is data splitting: in order to estimate critid (m) = P7 ( s m ) , 
some Tc {1, . . . , n} is chosen, one first trains s m (-) with (Ci)ieT, then test the trained estimator 
on the remaining data (£i)ieT c - This provides the hold-out criterion 

(2) critnoKT) := P^h = -^~ T) + f^f , 

and all cross-validation criteria are defined as averages of hold-out criteria with various subsets T. 

This paper focuses on F-fold cross-validation: Let V < n be a positive integer and let B = 
(Bi, ...,By) be some partition of {1, ...,n}. The T^-fold cross validation criterion is defined by 

1 V 

critvFCv("i,^) := y ^ cr itHo("T-,^) • 
K=l 
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Compared to the hold-out, one expects cross-validation to be less variable thanks to the averaging 
over V splits of the sample into (Ci)ieB K an d (£,i)ieB c K - 

Since critvFCvC 77 ^^) is known to be a biased estimator of E [critid(m)], Burman Bur89] proposed 
the bias-corrected V-fold cross-validation criterion 

crit cori . iV FCv("i, B) := crit V FCv(™, B) + P„7 (s m ) - — ^ P„7 {^ Bk) 

k=i 

2.4. Resampling-based and V-fold penalties. Another approach for building general data- 
driven model selection criterio n is pe nalization with a resampling-based estimator of the ideal 
penalty, as pro posed by Efron Efr83| with the bootstrap and recently generalized to all resam- 



penaity, as proposed Dy natron [i^rradlj witn tne bootstrap ana recently generalized to all resam- 
pling schemes |Arl09l ] . Let W ~ W be some random vector of M n independent from £ 1:n with 
n _1 Y^i=i Wi = 1, and denote by P^ = n~ l Y17=i the weighted empirical distribution of the 

sample. Then, the resampling-based penalty associated with W is defined as 

(3) V en w (m):=C w E w [(P n -P^h(s^)] , 

where S argmin tg5m {P^j (t)}, ['] denotes the expectation with respect to W only (that 
is, conditionally to the sample £i :n ), and Cyy is some positive constant. Resampl ing-base d penalties 
have been studied recently in the least-squares density estimation framework [Lerl2bl |. assuming 
W is exchangeable (i.e., its distribution is invariant by any permutation of its coordinates). 

Since computing exactly penyy(m) has a large computational c ost in g eneral for exchangeable W, 
some non-exchangeable resampling schemes were introduced in |Arl08j j. inspired by V-Md cross- 
validation: given some partition B = (£>i, . . . , By) of {1, . . . , ra}, the weight vector W is defined 
by Wj = (1 — Card(Sj)/n) _1 lj^g 7 for some random variable J with uniform distribution over 

{1, . . . , Vy. Then, P^ = Pn so that the associated resampling penalty, called V-fold penalty, 
is defined by 



C V 



pen VF (m,£,C) := - £ [(p*-P, 



K=l 



, 7 I S ^ 

V 



(4) = 2 ^E( P t BK) -Pn 



where C > is left free for flexibility, which is quite useful according to Lemma [TJ 

2.5. Links between V-fold penalties, resampling penalties and (corrected) V-fold cross- 
validation. In this paper, we focus our study on V-fold penalties because formula covers all 
V-fold and resampling-based procedures mentioned in Sections 12.31 and 12.41 

First, when V = n, the only possible partition is £>loo = ({l}>---j{ n })i an d the F-fold 
penalty is called the leave-one-out penalty pen LO o( m >C) := pen VF (m, Slocm C). The associated 
weight vector W is exchangeable, hence Eq. leads to all exchangeable resampling penalties since 
they are all equal up to a determ inistic multiplicative factor in the least-squares density estimation 
framework, as proved in Lerl2bl |. 



For V-fold methods, let us assume B is regular, that is, 

ry 

(H5*) Bis a partition of {l,...,n} and WK e {1,...,V} , C&rd(B K ) = — . 

Then, we get the following connection between V-fold penalization and cross-validation methods. 



l 



Lemma 1. In least-squares density estimation, under assumption (|H5*p . 

(5) crit CO rr,VFCv(m,£) = Pnl '(«m) + pen VF (m,B,V - 1) 

(6) crit V FCv(™,#) = -Pn7(s m ) + pen VF f m,,B, V- i 

(7) crit L po(™,?>) = -Pn7(s m ) + pen LPO ^m,P, ^ - ~j 

/ n/p — 1/2 

(8) = P n 7(s: m ) + pen LOO Im, (n-1)— y— — 

/ n/p — 1/2 

= -Pn7 (»m) + pen VF m,i3 L oo, (n - 1) ; r~ 

\ n/p — 1 

where for any p G { 1, . . . , n — 1 }, the leave-p-out cross-validation criterion is defined by 

with 8 P := {A C {1, . . . ,n} s.t. Card(A)=p} 

and the leave-p-out penalty is defined by 

VC>0, pen LPO (m,p,C7) : = ° ^ (P n - P^>) 7 ) . 

Lemma [T] is proved in Section [Fl 

Remark 1. Eq. ([5]) was first proved in Arl08i | in a general framework that includes least-squares 
density estimation, assuming only (|H5*p . Eq. ([6]) shows that F-fold cross-validation and V-fold 
penalization (with C = V — 1/2) yield the same criterion. Similarly, Eq. ([7]) shows leave-p-out 
cross-validation and leave-p-out pena lization (with C = n/p — 1/2) yield the same criterion. 
Eq. dHJ) follows from Lemma 6.11 in Lerl2b| since pen LPO belongs to the family of exchangeable 
resampling penalties, with weights Wi := (1 — jp/n)" 1 ! .-^ an d B is randomly chosen uniformly 
over £ p . It can also be deduced from Proposition 3.1 in [Cell2f | . see Section [FJ 

As a conclusion of this section, in the least-squares density estimation framework and assuming 
only (IH5*[) , it is sufficient to study F-fold penalization with a free multiplicative factor C > V — 1 
in front of the penalty for studying also ^-fold cross-validation, corrected l^-fold cross-validation 
and exchangeable resampling penalties. Therefore, in Sections [HHSl we focus our study on F-fold 
methods. Additional results on hold-out (penalization) are given in Section [7.11 for completing the 
picture. 



3. Oracle inequalities 

In this section, we state our first main result, that is, a non-asymptotic oracle inequality satisfied 
by V-fold procedures. The main novelty of this result is that it holds for any V G { 1, . . . , n}, any 
constant C > (V — l)/2 in front of the penalty, and the leading constant of the oracle inequality is 
as small as 1 + o(l) when C is well-chosen. In addition, as proved by Section \2.5\ they imply oracle 
inequalities satisfied by leave-p-out procedures for all p. 

Recall that, given a partition B of {1, . . . ,n} into V regular blocks, the V-fold estimator is 
defined by s(B, C) := s~fn,m,C)i where 

(9) m G arginin meMn {P n 7(s m ) +pen VF (m,B,C)} . 
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3.1. Main Assumptions. In order to state the main results, we assume the existence of some 
constants L+, cm, «Mi c r, Cr, r > such that 
• For all m S M n , 

2 

< UP 



(HI) 



sup 



sup 




• The models are nested, i.e., 

(H2) Vm, m' G M. n , S m U S m i G {S m , S m /} 

• The number of models is polynomial, i.e., 

(H3) Vn e N*, Card(7W n ) < c M n aM . 

• The oracle risk i?* := nmi m£ M„ E ^||s*m ~~ s l| 2 ) satisfies 
(H4) Vn e N*, c^(lnn) 4+r < < c£n(lnn)" 

• Pseudo-regularity of the partition B = (Bk)k=i,..,v, i-£-, 



(H5) 



B is a partition of { 1, . . . , n} and sup Card(£>i<-) 

i<K<y 



n 
V 



< 1 



The assumptions will be discussed in Section 



3.2. Oracle inequality for V-fold procedures. The first main result of the paper is the oracle 
inequality satisfied by V-fold estimators. 

Theorem 1. Let £ 1:n be an i.i.d sample with marginal density s € L 2 (/j,). Let {S m ) m &M n be a 
collection of linear spaces satisfying assumptions (IH1|) . (IH2j) . (IH3[) . (|H4j) w.r.t. the density s. Let 
B be satisfying assumption (|H5|) . C > (V — l)/2 and 



C 



k :- 



V-l' 



5:=2(k-1' 



e :- 



IV 



Inn \ Vlnn 



Let s be the estimator defined by Q. A constant L > 0, depending only on L+, cm> c^M; c r> Cr 



and r, exists such that, with probability at least 1 

(io) 1 " Lk£ ~ 6 

where Vit £ 



n 



s\\ 2 < inf 



^ || S Sm || ^ > 



1 + Lk£ + 5 + 
u + := m&x{u,0} and U- := max { —u, 0}. 

Theorem [1] is proved in Section [Gj Let us make a few comments. 

• The rate of convergence of the leading constant v / hrn(i?*)~ 1//4 was th e one ob tained — in 
an upper bound — for resampling penalties with exchangeable weights in (Lerl2bj |. Theorem 
Q] proves that V-fold penalties achieve at least the same rates as soon as V > O ( (In n) 3 ) . 
This is interesting, because V-fold penalties are exchangeable resampling penalties only 
when V = n. 

• Compared to previous results on V-fold penalization Arl08], an important feature of The- 
orem Q] is that all values V £ {l,...,n} are allowed. Furthermore, the remainder term 
e(n, V) decreases with V, which gives some theoretical confirmation that increasing the 
number V of data splits may improve the performance of V-fold methods, as observed 
empirically. See also Sections 141 and P7TT1 on this point. 
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• Theorem [TJ and Lemma [T] imply oracle inequalities satisfied by several estimators based 
on resampling criterions, according to Section 12.51 In particular, ^-fold cross validation 
corresponds to C = V — 1/2, corrected F-fold cross-validation to C = V — 1, and the 
leave-p-out to V = n and C = (n — l)(n — p/2)/(n — p). 

• The proof of Theorem Q] mainly relies on a new concentration inequality for l/-fold penalties 
(Proposition [25]) that is of independent interest. 

• We say that (jlOp is a non-asymptotic oracle inequality in the sense that all parameters can 
vary with n as long as (|H1|) . (|H2|) . (|H3|) . (|H4p hold. However, the constants involved may 
not be sufficiently small to have a result interesting for small samples. 



3.3. Discussion of the assumptions. (|Hip . (|H2|) and (|H3P hold in classical collections of models 
considered in density estimation, such a s regular dyadic his togram spaces, Fourier spaces and regular 
wavelet spaces, see for example |Lerll |. We refer also to DL93j ] for a more complete presentation 
of these spaces and their approximation properties. 

All our results can also be applied to regular histograms provided that || s [| ^ < oo. Moreover, (|H2j) 
can be replaced by one among (|H2 7 ) and (|H2 <> P below: 



(H2') 



< oo, 



sup 

{m,m')&M^ l 



tes, 



sup 

,+S m /,||t||<l 



1*111 < Tn 



(H2°) (</ ) A)AeA is an orthonormal basis of L 2 (/i) and 3A m C A, S m =< (4>\)\£A m > • 

These two assumptions are considered in They ensure a general assumption ( |H2g[ ), 

introduced in Lemma [12] in the proof of Theorem Q] holds. ( |H2g[ ) is sufficient to prove the main 
theorems. 

(|H4j) is a technical assumption. The lower bound means essentially that we are in a non parametric 
situation. The upper bound roughly means that at least one of the estim ators is consis tent. Note 
also that (|H4p is weaker than the assumption on the bias made in Arl08l | . We refer to Lerll] for 
more details on the latter point. 

Assumption (|H5*P could be relaxed, at the price of enlarging the bound (|10p . depending on how 
far B is from being regular. Throughout the paper, we choose to focus on (|H5p . or even on (|H5*p . 
to keep the results and their proofs simple. Note also that regular partitions are the most classical 
ones for V-fold methods. 



3.4. Comparison with previous works. Few non-asymptotic oracle inequalities have been proved 
for l/-fold penalization or cross-validation procedures. 

ACldl In the 



Concerning cross-validation, previous oracle inequalities are listed in the survey 



least-squares density estimation framework, oracle inequalities were proved by vdLDK04l ] in the 
F-fold case, but compared the risk of the selected estimator with the ri sk of an oracle trained with 
n(V — 1)/V data. Optimal oracle inequalities were proved by |Cell2l | for leave-p-out estimators 
with p <C n. In comparison, Theorem Q] gives an oracle inequality for any V and considers the 
strongest possible oracle, that is, trained with n data. Moreover, leave-p-out criterions are studied 
for V = n, C = (n — \){n/p — 1/2) /(n/p — 1). In that case, we have C = V — 1 + o(V — 1). Henc e 
the leading constant in Theorem Q] is asymptotically equal to 1 and we recover the result of [Cell2 |. 

Concerning F-fold penal ization, previous results were either valid for V = n only (in least- 
squares density estimation [Lerl2bl | and for regresso gram e stimators Arl09]), or for V bounded 
when n tends to infinity (for regressogram estimators Arl08l ] ) . In comparison, Theorem Q] provides 
a r esult y alid for all V € { 1, . . . , n}. In particular, the leading constant of the oracle inequality 
of Arl08l | increases with V, whereas the leading constant in Eq. (I10p decrea ses as V increases, as 
observed in simulation experiments (for instance, in Section [5] and in ArlO£ 
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4. Dependence on V of F-fold penalization and cross-validation 



An interesting feature of our oracle inequality is that the remainder term e(n, V) improves with 
V . However, our concentration inequality provide the same deviation rate as in the exchangeable 
case for V = 0((hm) 3 ). In this section we would like to investigate further in this direction and 
understand more precisely the dependence in V of the y-fold procedures. 

In order to do so, let us consider the first step of the proof of Theorem [TJ by definition Q, 
s(B, C) satisfies, for all m 6 M. n , 

(11) \\s - s\\ 2 < ||s m - s\\ 2 + pen VF (m,£>, C) - pen id (m) + pen id (m) - pen VF (m,£>, C) . 



Then, two quantities play a key role for deriving an oracle inequality from Eq. (jllh : the expectation 
and the deviations of the normalized increments 

A(m,m',B,C) 

'■= V™ (pen VF (m, B, C) - pen id (m) - (pen VF (m', B, C) - pen id (m')) ) 

for all m,m' E A4 n , or at least for m, m' that are "close to the oracle m* or likely to be se- 
lected", since only m,m' £ {m*,m} truly matter at the end. The influence of the expectations 
E [ A(m, m! , B, C)] is clearly enlightened by Theorem [TJ thanks to the term <5(C, V) which appears 
in the leading constant of the oracle inequality (|1Q|) . In this section, we further investigate the 
amplitude of deviations of A(m, m' ,B,C) by comput ing the ir variance as a function of V. The 
quantity A(m, m',B, C) is relat ed to relative bounds [Cat07l . Section 1.4] which can be used as a 
tool for model selection |Aud04| . 

We focus here on C = V — 1/2, that is, the y-fold cross-validation criterion, and on C = V — 1 
which corresponds to its corrected version (see Lemma [1]). Results valid for any C > are given in 
Proposition [15] in the appendix. For simplicity, we assume all blocks have the same size n/V, that 
is, assumption (|H5*|) holds true; in particular, V divides n. 

Theorem 2. Let S m , S m i be two linear subspaces of L A (n), and let {ip\) \^A m {resp. (VoOAeA m /) be 
some orthonormal basis of S m (resp. S m >) in L 2 ([i). For every q, r 6 { 1, 2 }, A, A' G A m U A m / and 
A, A' G {A m ,A m /}, let us define 

c{ q p ■.= e [( Mb) - PM ) q (Mb) - nMY] , 

WA :=var(^A(6)), /3(A,A'):= £ {Cx^f^ 

AeA.A'eA' 

B ( A m , k m i ) =/3(A m ,A m ) + /3(A m /,A m /)-2/3(A m ,A m /) . 
If B satisfies (|H5*j) . then, 

(12) var ( A(m, m', B, V - 1) ) = 4 var P (s m -s m ,)+ * V B ( A ™> A ™') . 

v V — 1 n 

Moreover, for every A, A' £ {A m , A m /}, let us define 

7 (A, A') := £ [P^x)C^ +PW X ,)C™' 
AeA.A'eA' 

C (A m , A m /) := 7(A m , A m ) + 7 (A m /, A m > ) - 27 (A m , A m /) 

C(A,A'):= E ^i^- 
AeA.A'eA' 

and D (A m ,A m /) := C (A m , A m ) + C (A TO ', A m / ) - 2C (A m , A r , 




Let k = 1 + [2(7 - and v = 1 + k 2 (V - l)" 1 - [4n(7 - l) 2 ]" 1 . For every m,m! G A4„, i/ 

AvFCv(m,m') := A f m,m',H,7 - ~\ 



'n [crit V FCv( m ) - critid(m) - (crit V FCv( m ') — critid ("*'))] 



(13) var (A V fcv(?™>?™')) = 4var P (s m - s m r) + (A m , A 



_ 2C (A m , A m / ) D (Am, A m / ) 
(7-l)n (7-l) 2 n 2 

Finally, for the {corrected) V -fold criterion itself, we have 



4 , , 107 



2 



var (critvFCv(^; &)) = - varp ( s m ) + 



n ""•*" ' (7-l) 2 n 2 



6 2 1 

' ~ 57 + 5T 2 ~ 5^ 



/3(Ar, 



27 7 2 
(14) -(^I)^7(A m ) + ^- w C(A m ) 



4 2 

(15) var ( crit CO rr,VFCv(w; B) ) = - varp ( s m ) + — 



n n z 



V + 3 1 

V - 1 ~ n 



/3(A m ) 



-47(A m ) + ^3C(A m ) • 

Theorem [2] is proved in Section [Hi where the variance of A(m,m' ,B,C) is computed for any 
C > 0, as well as the variances of pen VF (m, B, C), pen id (m), pen VF (m, B,C) — pen id (m) and 
F'nli'sm) + pen VF (m, B, C) (Proposition [T5j) . By Lemma (TJ the variance of the increments of 
several resampling criterions (V-fold cross-validation and leave-p-out) can then be deduced from 
Proposition [T5l 

The key quantities B(A m ,A m /), C(A m ,A m /) and D(A m ,A m /) appearing in Theorem [2] do not 
depend on the choice of particular bases (V'a) AeA m j (tp\)\£A m ,, see Section lL2l in the supplementary 
material. 

Interpretation of Theorem [2] with regular histogram models. Assuming a particular structure for 
the models S m ,S m ', Eq. (fT2|) and (fT3|) can be simplified, allowing to compare them, and to make 
their dependence on 7 clearer. 

Let S m and S m > be the two regular histograms models of respective sizes d" 1 and d~, . Formally, 
S m is defined as the vector space of functions constant on each interval Ik >m := [k/d m ,(k + l)/d m ), 
k G Z, and S m / is defined similarly. Then, if for any iftkm( x ) = vtb^f \j the family 

(ipk,m)keZ, is an orthonormal basis of S m . 

By Proposition [21] in Section 11.21 Eq. (fl~3"l) becomes 



(16) var ( A V fcv("i, m') ) 



— fa ( V) B ( A m , A m / ) + 4 ( 1 + S ( V, n) ) varp ( 
n 



where fa (7) := 1 + — — - + — 1 + ' ' 



7-1 (7-1) 2 4(F-1) 3 4n(7-l) 2 



and ^ (y ' n)= (7^ + (7- 1 l) 2 n 2 = ° (1) • 
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So, the variance is slightly larger for l^-fold cross-validation than for l/-fold penalization, but not 
much larger. If V stays bounded as n tends to infinity, only the first term is multiplied by a bounded 
factor. If V — > n ^oo oo, both procedures yield the same variance asymptotically (uniformly over 



m, m 



Eq. (|12p and (|16p show that the variance term of y-fold penalization and cross-validation depend 
on V like 



-f(V)B(K. 
n 



, A m / ) + 4 varp ( 



for some decreasing function / that depends on the procedure considered. So, in both cases, 
increasing V decreases the variance of the procedure. In order to understand by which factor the 
variance decreases when V increases, we have to compare the terms B ^ A "^' Am ' - and varp (s m — s m i ). 

Let us now assume in addition that S m > C S m , that is, d m > divides d m since we consider regular 
histogram models. This holds for instance with dyadic regular partitions. 
Section 11.21 shows that B (A m , A m > ) is of the order of d m > (at least when d m 
Il s m'|l — L \\ s II > f° r some constant L). In addition, 

II ii 2 



Then, Remark H] in 
is large enough and 



< varp ( s r 



< 



< 



and if we assume S m and S m f are both "close to the oracle", the bias terms ||s — s r 



and the expected variances n d rt 



n 



d m / approximately match. 



Overall, these informal arguments suggest that when S m > C S„ 



(17) 



Lxf(V)^<8f(V) 



B ( A m , A„ 



+ 4 var p ( s Ti 



are both "close to the oracle", 
d,, 



s m ')<L 2 (f(V)+L 3 )^ 
n n n 

for some positive constants Li,L2,L 3 . Since 1 < /(V) < 4 whatever V for both cross-validation 
and penalization, the maximal and minimal values of the variance (obtained with V = 2 and V = n 
respectively) allowed by Eq. (|17p only differ by a constant factor. More precisely, for cross-validation 
/(2) = 3.25 + o(l) and /(10) < 1.124, and for penalization f{2) = 2 and /(10) = 10/9 < 1.12. So, 
increasing V from 2 to 10 already puts the variance very close to its minimal value. Increasing V 
again (say, from 10 to 50) may not improve much the performance, at least in terms of variance. 

The conclusion of these informal arguments is confirmed by the simulation study of Section [5j 
see in particular Section 15.41 



Another interesting feature of this informal argument is that the parameter V appears in the 
first order term in the variance of the increments Ayyc yim, ml ). Most of the existing results 
focused on the variance of var ( crityFCV ( m ) ) • Burman [Bur89j obtain ed asymptotic estimates 
of var (crityFcv(m)) in a regression framework. Celisse [CelO 1, ICell2j l and Celisse and Robin 
CR08I ] computed exactly the variances of var ( crityFCV ijn) ) and var (critLPO ) i n the least- 



squares regression framework with projection estimators. These variances do not show clearly the 
influence of the parameters V since it only appears in second-order terms, of order 0(n~ 2 ), in 
var ( crityFCV ijn) ) . Actually, Eq. (fill) shows that, in the histogram case, 



(18) var (critvFCv( m -; B) ) 



1 4 / 1 / 1 

- + ^r 1 + - — t + [ - 



n n z 



V-l 



n 



V-l 



1 + 



varp { s m ) 
2 1 



1 



(3 (Am) 



and Eq. p5|) becomes 



. l + 0(l/n) . , 2 
(19) var ( crit cor r,VFCv(ra; B) ) = varp ( s m ) + —r 



V V(V-l) n 
1 + - - p{A m ) . 



V-l n 
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Figure 1. The two densities considered. Left: setting L. Right: setting S. 

5. Simulation study 

This section illustrates the main theoretical results of the paper with some experiments on 
synthetic data. 

5.1. Setting. In this section, we consider X = [0,1] and u is the Lebesgue measure on X. Two 
examples are considered for the target density s and for the collection of models (S m ) m ^M- 
Two density functions, s are considered, see Figure [TJ 

• Setting L: s(x) = ^l < x<1/3 + ( 1 + f ) 1 i>x>i/3 • 

• Setting S: s is the mixture of the piecewise linear density sq(x) = (8x — 4)l 1>a; >x/2 (with 
weight 0.8) and four truncated gaussians with means (fc/10)fc = i _ 4 and common standard 
deviation 1/60 (each with weight 0.05). 

Two collections of models, are considered, both leading to histogram estimators: for every m G Ai, 
S m is the set of piecewise constant functions on some partition A m of X . 

• "Regu" for regular histograms: Ai = { 1, . . . , n} where for every m G Ai , A m is the regular 
partition of [0, 1] into m bins. 

• "Dya2" for dyadic regular histograms with two bin sizes and a variable change-point: Ai = 
Ufce{l,...,n} i k } x {°)---' U n 2(^)J } x {0, • • • , [ln 2 (n - k)\ } where n = [n/\n{n)\ and for 
every (k, G Ai, A-(k,ij) ^ s the union of the regular partition of [0, k/n) into 2* pieces and 
the regular partition of [k/n, 1] into 2 3 pieces. 

The difference between "Regu" and "Dya2" can be visualized on Figure [2j where the corre- 
sponding oracle models have been plotted in setting S. While "Regu" is one of the simplest and 
most classical collections for density estimation, the flexibility of "Dya2" allows to adapt to the 
variability of the smoothness of s. Intuitively, in settings L and S, the optimal bin size is smaller 
on [0, 1/2] (where s is varying fastly) than on [1/2, 1] (where is much smaller). 

Another point of comparison of Regu and Dya2 is given by Table [H that reports values of 
the quadratic risks obtained depending on the collection of models considered. Table Q] shows 
that in settings L and S, the collection Dya2 helps reducing the quadratic risk by approximately 
20% (when comparing the best data-driven procedures of our experiment), and even more when 
comparing oracle estimators (30% in setting S, 59% in setting L). Therefore, in settings L and S, it 
is worth considering more complex collections of models (such as Dya2) than regular histograms. 
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Figure 2. Oracle model for one sample, in setting S. Left: Regu. Right: Dya2. 

Table 1. Comparison of Regu and Dya2: quadratic risks E[||s — S^|| 2 ] of "Ora- 
cle" and "Best" estimators (multiplied by 10 3 ) with the two collections of mod- 
els. "Best" means that m is the data-driven procedure minimizing E[ ||s — S^|| 2 ] 
among the data-driven procedures appearing in Table [3 "Oracle" means that 
in S arg min me _A4 ||s — Sm|| 2 is the oracle model for each sample. 



Setting 


Oracle(Regu) 


Oracle(Dya2) 


Best (Regu) 


Best(Dya2) 


L 


13.3 ±0.2 


5.49 ±0.06 


25.5 ±0.3 


19.8 ±0.3 


S 


62.7 ±0.4 


43.9 ±0.3 


101.0 ±0.8 


83.7 ±0.7 



Let us finally remark that Dya2 does not reduce the quadratic risk in all settings as significantly 
as in settings L and S. We performed similar experiments with a few other density functions, 
sometimes leading to less important differences between Regu and Dya2 in terms of risk (results 
not shown). The oracle model was always better with Dya2, but in two cases, the risk of the best 
data-driven procedure with Dya2 was larger than with Regu by 6 to 8%. 

5.2. Procedures compared. In each setting, we considered the following model selection proce- 
dures: 

• Mallows' C p : penalization with pen(m) = 2d m /n, where d m = Card(A m ) denotes the 
number of bins. 

• F-fold cross-validation with V £ {2,5,10} and the leave-one-out (that is, l^-fold cross- 
validation with V = n) , see Section 12.31 

• y-fold penalties (with C = V — 1), for V G { 2, 5, 10} and the leave-one-out penalty (that 
is, y-fold penalty V = n), see Section [231 

• for comparison, the expectation of the ideal penalty E[pen id (m)]. 



Since it is often suggested to multiply the usual penalties by some factor larger than one [Arl08j, 
we considered all penalties above multiplied by a factor chosen among {1,1.25,1.5,2}. Then, in 
every setting, the risks of the estimators selected with E[pen id (m)] gave us the optimal factor C* 
by which all penalties should be multiplied. In Table [2j we only kept results corresponding to each 
penalty multiplied by C*, and we reported the value of C*. Complete results can be found in the 
appendix (Table [3]). 
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Table 2. Estimated model selection performances, see text. Performances of the 
best data-driven procedures (that is, the best one and all procedures not significantly 
worse) are bolded. All penalties are multiplied by the factor C*, which is chosen 
according to results obtained with E[pen id ], see Section [5T2l 



Procedure 


L-Dya2 


S-Dya2 




4.38 ± 0.09 


3.01 ± 0.04 


pen2F 


5.12 ±0.12 


2.10 ±0.02 


pen5F 


3.80 ± 0.07 


1.95 ±0.02 


penlOF 


3.66 ±0.06 


1.91 ±0.02 


penLOO 


3.61 ±0.06 


1.91 ±0.02 


2FCV 


6.41 ±0.16 


2.10 ±0.02 


5FCV 


6.27 ±0.16 


2.09 ± 0.03 


10FCV 


6.25 ±0.16 


2.07 ±0.03 


LOO 


6.41 ±0.18 


2.08 ±0.03 


C* xE[pen id ] 


3.66 ±0.06 


1.93 ±0.02 


C* 


2 


1.5 



5.3. Model selection performances. In each setting, all procedures have been compared on 
N = 1000 independent synthetic data sets of size n = 500. For measuring their respective model 
selection performances, we estimated for each procedure m(-) 



C n 



E 



infn 



,EM II s ~ S T 



which represents the constant that would appear in front of an oracle inequality. The uncertainty 



of estimation of C OT is measured by the empirical standard deviation of 



divided by 



infmSMll 

N. The results are reported in Table [2] for settings L and S, with the collection Dya2. 
Resul ts for Regu are not reported here since C p is already known to work well with Regu see 
Lerl2bj . so F-fold methods would not improve significantly its performance, with a larger compu- 



tational cost. Complete results (including Regu) are given in Table [3] in the appendix, showing the 
performances of C p and 1^-fold methods indeed are very close. 

Performance as a function of V. Let us first consider y-fold penalization. In both settings L and S, 
as suggested by our theoretical results, C or decreases when V increases. The improvement is large 
when V goes from 2 to 5 (26% for L, 7% for S), small but significant when V goes from 5 to 10 (4% 
for L, 2% for S), and not significant when V goes from 10 to n = 500. Since the main influence of V 
is on the variance of the l^-fold penalty, these experiments confirm our interpretation of Theorem [2] 
in Section increasing V helps much more from 2 to 5 or 10 than from 10 to n. 

The picture is less clear for T^-fold cross-validation, for which no significant difference is observed 
among V 6 {2,5, 1 0, n|, and C or is minimized for V £ { 5, 10 }. Indeed, as explained in a previous 
work in regression Arl08j ] . increasing V simultaneously decreases the bias and the variance of the 
y-fold cross-validation criterion, leading to various possible behaviours of C or as a function of V 
depending on the setting. 

Other comments. Table [2] confirms in the least -squar es density estimation framework several facts 
previouly observed in least-squares regression Arl08l | : 
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• Cp performs much worse than y-fold penalization (except V = 2 in setting L) with the 
collection Dya2. On the contrary, C p does well with Regu (see Table [3] in the appendix), 
but F-fold penalization then performs as well. 

• y-fold cross-validation performs significantly worse than V-fold penalization (except in set- 
ting S with V = 2, where 2-fold cross-validation coincides with 2-fold penalization multiplied 
by 1.5, as shown in Section \2A\i . Nevertheless, making a bad choice for C* (which depends 
on the setting) can lead to worse performance with F-fold penalization, especially when 
V = 2 (see Table [3] in the appendix). 

In other settings considered in a preliminary phase of our experiments, differences between V = 2 
and V = 5 were sometimes smaller or not significant, but always with the same ordering (that is, the 
worse performance for V = 2). In a few settings, for which the "change-point" in the smoothness 
of s was close to the median of sd/i, we found C p among the best procedures with collection Dya2; 
then, l/-fold penalization and cross-validation always had a performance very close to C p . Both 
phenomena lead us to discard all settings for which there were no significant difference to comment. 

5.4. Variance as a function of V. We now focus on illustrating theoretical results of Section H] 
about the variance of l^-fold penalization and its influence on model selection. Let us go back to 
the informal arguments at the beginning of Section EJ in order to understand precisely the role 
of deviations of A(m,m' ,B,C) in the corresponding model selection procedure. For the sake of 
simplicity, we focus on the unbiased case (C = V — 1) in this subsection. 

By definition ([9]) of s(B, V — 1), a model m G M. n can be selected if and only if for all m' G M n , 

Pnl{s m ) + pen YF (m,B,V - 1) < P n l{s m <) + pen VF (m' ,B,V - 1) , 

which is equivalent to 

critid(?n) — critid(m') < n _1 / 2 A(m', m, B, V — 1) . 

The left-hand side is of order E[critid(m-) — criti^rra')], whereas the right-hand side, which is cen- 
tered, is at most of order n _1 ' 2 y / var (A(m',m,B, V — 1)). Moreover, when selecting m instead of 
the oracle, the risk increase is of order E[critid(m) — inf jn / 6 _A / j n critjd(m')]. 

Therefore, the influence of V (through the variance of the criterion) on the model selection perfor- 
mance can be visualized by plotting the difference E[crit;d(m) — critid(m/)] — n _1 / 2 -y/var ( A(m', m,B,V — 1) ) 
for all m £ M n and, say, m' = m* G argmin mg _ A/(n E[crit;d("2,)] the oracle model. When this quan- 
tity is negative for some m, it means the corresponding procedure can loose up to E[critid(m) — critjd(m') ] 
in terms of risk. So, the smaller the set of such m G M n is, the better the procedure should be. 
Such information is provided on Figure 

An alternative visualization of the same phenomenon is to determine the set 

(20) M scl (V) := {m G M s.t. VW G M , 

E [critid(m.) — critid(m/)] < -\/var ( A(m', m, B, V — 1) ) jn j , 

which can be interpretated as "the set of models that could be selected by penalization with 
pen VF (-,i3,y — 1)", according to the above informal argument. The smaller is this set, the better 
should be the procedure. Such information is provided on the right of Figure EJ 

More precisely, Figure [3] considers the setting S with a sample size n = 500 and the col- 
lection Regu (for which models are naturally indexed by their dimension). On the left part, 
E[critid(?n) — critid(m')] — n _1/ ' 2 y / var ( A(m', m, B, V — 1) ) is plotted as a function of the dimen- 
sion d m for V G {2,5, 10, n}, as well as E[critid(m) — critid(?n / )] (black line). Figure [3] confirms 
the result of Section H) the term yVar ( A(m', m, B, V — 1) ) decreases when V increases, which can 
explain the improvement of the model selection procedure performance observed in Table [21 More 
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Figure 3. Visualization of standard-deviations as a function of V in experiment 
S-Regu, see text. Left: global picture (1 < m < 500). Right: zoom on the left part; 
the colored dots represent the set of "selectable models" A4 scl for V G {2,5,10, n}, 
with the same color code as the colored lines. 



precisely, the standard-deviation term decreases much more from V = 2 to V = 5 than from V = 5 
to V = 10 or n. Even when zooming on the graph (right of Figure [3]), V = 10 and V = n are hard 
to distinguish. On the contrary, going from V = 2 to V > 5 seems to reduce the standard-deviation 
of A(m*, m,B,V — 1) by a factor k > 1 for all m, which confirms the informal arguments provided 
at the end of Section [U in setting S-Regu with n = 500, when m' = m*, it seems 

var (A(m',m,H, V- 1)) « yK + 

for some constants K,K' > 0. 

On the right part of Figure El the colored dots represent the set A4 (V) for V = 2, 5, 10 and 
n (from top to bottom). As expected from previous results, M. (V) is a decreasing function of 
V, and the difference between V = 2 and V = 5 is much larger than between V = 5 and V > 10. 
Reducing Ai (V) then can explain how increasing V makes the excess risk of the selected model 
smaller, as observed in Table [2) with less "selectable" models, the selected model will be more 
likely to be close to the oracle (in terms of risk). This phenomenon explains in part why it can be 
sufficient to take V = 5 or V = 10 in practice. 



dm. V dm.' 



n 



6. Fast algorithm for computing V-fold penalties for least-squares density 

estimation 

Since the use of y-fold algorithms is motivated by computational reasons, it is important to 
discuss the actual computational cost of V-fold penalization and cross-validation as a function 
of V. In the least-squares density estimation framework, two approaches are possible: a naive 
one (valid for all frameworks) and a faster one (specific to least-squares density estimation). For 
clarifying the exposition, we assume in this section (|H5*j) holds true (so, V divides n). The general 
algorithm for computing the V-fold penalized criterion and/or the V-fold cross-validation criterion 
consists in training the estimator with data-sets {^iji^s- f° r 3 = 1,---,V and then testing each 
trained estimator on the data sets (£i)ieSj and (Ci)i^B - I n the least-squares density estimation 
framework, for any model S m given through an orthogonal family (V^AeA °^ d m elements of 
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L 2 (/i) , we get the "naive" algorithm described and analysed more precisely in Section II. 3. II in the 
appendix, whose complexity is of order nVd m . 

Several simplifications occur in the least-squares density estimation framework, that allow to 
avoid a significant part of the computations made in the naive algorithm. This leads to the following 
algorithm. 

Algorithm 1. 

Input: B some partition of {l,...,n} satisfying (IH5*j) . £i,...,£ n G X and ( ip\ ) AeA m a 
finite orthogonal family of L 2 (//), with Card(m) = d m . 

(1) For i G { 1, . . . , V} and A G A m , compute := - HjaB, ^a(O) 

(2) For i,j G {1, . . . , V }, compute C;j := EasA™ 

(3) Compute 5 = Ei<? j <v Cij and 7" = tr(C) . 
Output: 

Empirical risk: P„7 (? m ) = — 5/V 2 

V-fold cross-validation criterion: critvFCv("i) = y(y~_i) ~~ > 

V-fold penalty: pen VF (m) = (crit V FCv("i) - -Pn7 (%)) ^7=1 ■ 

Up to the best of our knowledge, Algorithm [1] is new, even for computing the V-fold cross- 
validation criterion. Its correctness and complexity are analyzed with the following proposition. 

Proposition 2. Algorithm^ is correct and has a computational complexity of order (n + V 2 )d m . 

1/2 

In the histogram case, that is, when A m is a partition of X and VA G A m , tp\ = \\\ 1\ , the 
computational complexity of Algorithm^ can be reduced to the order of n -\- V 2 d m . 

Proposition [2] is proved in Section II.3.2I in the supplementary material. 

7. Discussion 

7.1. Hold-out penalization. Similarly to all previous results on V-fold methods, we can analyze 
the hold-out methods where data are only split once. 

7.1.1. Definition. First, we recall the definition of the hold-out criterion given in Given T C 
{1, . . . , re}, we train the estimators sin with the data set (£i)«6T and estimate its risks with the 
remaining data set (£j)ieT c > which gives the hold-out criterion 

critHoK T) = -2P^ ( 4 r) ) + 4 T) 2 = Pt T) l ( 4 T) 



Similarly, the hold-out penalty is defined as the hold-out estimator of pen id (m), that is, 
(21) pen HO K T, C) = 2C( Pf > - ) ( s£? ~ 4" T) ' 



and the hold-out penalization estimator is defined by s~ho = ^fh HO 1 where 

(22) m H o = rh H o{T, C) = argmin mg _ Mn { P„7 ( s m ) + pen HO (m, T, C) } . 

7.1.2. Oracle inequality. Similarly to Theorem [H we prove in Section II. 1.11 the following oracle 
inequality for hold-out penalization estimators. 

Theorem 3. Let be i.i.d real valued random variables with density s G L 2 (/x). Let {S m ) m ^M n 
be a collection of linear spaces satisfying Assumptions dHl]) . (|H2jb (lH3l) . (|H4|) . Let C > 0, 
T C {1, . . . , n} be a training set with n% = Card (T), n v = n — nt, let 

l2 run „ / ho - \ , m villi 



-HO_ c ^ ; ,5 HO = 2( K HO -l) , and eM V 



16 



Let sho be the hold-out penalization estimator defined by Eq. (|22p . An absolute constant L > 
exists such that, with probability at least 1 — n~ 2 , 

1 - go _ l{k ho y 1)£ f ) _ 

1 + 8%° + L(k ho V l)ef } me ^- 

Let us make a few comments. 

• Let V be a divisor of n and let n v = n/V, nt = n — n v , C = ntn v /(2n 2 ) so that k = 1 
and <5 HO = 0. Theorems Q] and [3] show the stabilization effect of V-fold procedures. For 
large V, is of order (n~ 1 V(ln n) 2 ) 1 / 4 whereas e in Theorem Q] remains of the correct 
order ( i?* 4 \/lnn. 

• When V = 2, it is easy to check on formula (j2"Tj) that the hold-out penalty pen^ built with 

{ 1, n} \T is exactly the same as pen^Q built with T. Hence, the 2-fold cross validation 

penalty pen 2 ^ is equal to (pen^Q + penjjQj/2 = pen^Q. This proves the logarithmic loss 
in the rate e(n, V) in Theorem [T] is only due to technical reasons. 

• Similarly to Theorem [2j the variance terms can be computed for the hold-out penalty in 
order to understand separately the roles of the training sample size and of averaging over 
the V splits, in the T^-fold criteria. See Proposition [19] in Section [1. 1.21 for details. 

7.2. Other model selection procedures for density estimation. Least-squares density esti- 
mation is a classical problem of non-parametric statistics and several model selection procedures 
have been studie d in this framework. Oracle inequ alities can be derived, for examp le, for li penal- 
ization methods BTW07], aggregation procedures {RT07], blockwise Stein method Rig06] or using 



T-estimators Bir08l |. Up to our knowledge, none of these methods yield oracle inequalities without 



remainder terms and with a leading constant asymptotically equal to one at the level of generality 
presented in this paper. For example, our results are valid for data ta king value in any metric space 
and the models can be of infinite dimension. The results of |Bir08l ] hold for infi nite di mensional 
models but the estimators are not computable in practice. Let us mention here BR06| proposed 



a precise evaluation of the penalty term in the case of regular histogram. Their final penalty is a 
modification of C p , performing very well on regular histograms. These performances are likely to 
become much worse on the collection Dya2 presented in Section [5j This can be seen, for example, 
in Table [3] where we presented the performances of C p with different over-penalizing constants. 

7.3. Conclusion on the choice of V . Overall, choosing V requires a trade-off between: 

• Computational complexity, usually proportional to V, sometimes smaller, see Section [6l 

• Statistical performance in terms of risk, which is better when the bias and the variance are 
small. The bias decreases as V increases for T/-fold cross-validation, but it can be removed 
completely or fixed to any desired value by using l/-fold penalization instead, see Section [2^1 
The variance decreases as V increases, but it almost reaches its minimal value by taking, 
say, V = 5 or V = 10, as shown by theoretical and empirical arguments in Sections H] and [5j 

The most common advice for choosing V in the literature (htfosL for instance, Section 7.10.1] 
are between V = 5 and V = 10. This article provides clear evidence why taking V larger does 
not reduce the variance significantly. Concerning the bias, Lemma Q] shows 5-fold (resp. 10-fold) 
cross-validation corresponds to overpenalization by a factor 1 + 1/8 (resp. 1 + 1/18), which is likely 
to be a good amount in many ca ses; in our simulation experiments, the best overpenalization factor 



was even larger, see also [Arl08l | . 

Note 
in the li 
settings 



Note however that ou r results are only valid for some least-squares algorithms, and it is reported 
m the literature |ACl(]j | that y-fold cross-validation behaves differently as a function of V in other 
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Finally, we would like to address the question of choosing between U-fold cross-validation and 
penalization. The answer is rather simple (at least in least-squares density estimation), since 
Lemma Q] shows U-fold cross-validation is a particular instance of U-fold penalization, with C = 
V — 1/2 . So, if one wants to overpenalize by a factor (V — 1/2) /(V — 1), U-fold cross-validation is 
definitely the good choice. Otherwise, the best choice would be U-fold penalization with another 
value for C, depending on how much one wants to overpenalize. 

Acknowledgments. The authors acknowledge the support of the French Agence Nationale de la 
Recherche (ANR) under reference ANR-09-JCJC-0027-01 (Detect project). 
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Appendix A. Additional notation 

Throughout the appendix, B = {B\, . . . , By) denotes some fixed partition of { 1, . . . , n} satisfying 
(IH5*p , We refer to Section 11.51 in the supplementary material for the general case. For every 
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m G A4 n , let (VoOasA™ be an orthonormal basis of S m G L 2 (/j,), and 

1 ™ 

^ : = — E E (^(Ci) - pi/>x)(Mtj) - p^> 



i^j=l AeA 



1=1 ijEBj AeA„ 



Finally, for all m,m' G M n -> we define 

{t e S m + S m >, \\t\\ < 1} 
sup P[(i-Pt) 2 ], v. 



»m,m' 
,,2 



> — Ju 'm,m j 



m — u m,m i 



SUp Halloo > b m — bm,m ; 



tea,, 



-Dm 



P sup (t - 



nE 



nils — s m || +D m = nE 



S — Sr 



Appendix B. New results on the T/-fold penalty 

This section gathers two results on the T^-fold penalty that can be of independent interest: an ex- 
act formula (Proposition [3]) and a concentration inequality (Proposition |4|). These two propositions 
play a key role in the proof of our main results. 

B.l. Exact formula. 



Proposition 3. Let V > 2, n > 4 and B satisfying (|H5*j) . For every m G M n > with the notation 
of SectionlAi 



(23) 



V- 1 ( \ II- 112 V , TT 

2C pen VF (m) = ||s m - s m || - — - - (U m - U B , 



Proposition [3] is proved in Section [Dj Note that 2 ||s* m — s m || 2 is the main part of the ideal penalty, 
as explained by Eq. ([34"|h Therefore, Proposition [3] provides an exact formula for pen VF (m) — 
pen id (m), which is crucial for proving an oracle inequality like Theorem [TJ 

B.2. Concentration inequality. 



Proposition 4. Let V > 2, n > 4, B satisfying (|H5*j) and S m be some model satisfying (|H1|) and 
(|H4j) with i£* replaced by R m = nE[||s — s" m || 2 ]. Let pen VF (m) = pen VF (m, B, C) be the V '-fold 
penalty on S m defined by Eq. (jl]). Let C* be the constant defined above Eq. (I35p . Let 



ei(m,V,x) = C*— r 
Rr 



X 



1 + 



+ 



X Rm. 



n 



Then, an absolute constant V exists such that, for every < x < ^ A V 1 / 6 RU 4 , 

f . n r 

(24) 



C R 

P( |pen VF (m) - E [pen VF (m)]| > L' — — -ei(m, V, %) — ) < (e z + 6)e~ 
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Furthermore, for some i?* such that < R± < R m , let 



1 + 



e 2 (n,V,x) = C±—^ 
R* 



VV3 



+ 



x^Rl' A 



n 



Then, an absolute constant L > exists such that, for any 2 < x < ^ A V 1 / 6 R 1 J 4 



(25) 



V - I 2 

g pen VF (m) - 2 ||s m - s m || 



> Le 2 (n, V,x) 



R„ 



n 



< (e 2 + 4)e" 



1S 



is 



Proposition S] is proved in Section lEl 

Eq. (|24p is a concentration inequality for the V-fold penalty around its expectation. Eq. 
a formulation directly useful for proving an oracle inequality like Theorem (TJ since 2 ||s m — 
the main part of the ideal penalty, as explained by Eq. (j34l) . 

Only one concentration inequality was prove d for the y-fold penalty before Proposition [4j for 
regressograms with the least-squares loss [Arl08l | . The main novelty of Prop osition [4] is to apply to 
all values of V, and to provide smaller deviation terms when V increases. In [Arl08j ] . it was assumed 
V < ln(n) and the deviation bounds get worse when V increases, which is highly non-intuitive. 

When the variables £i :n are real valued, the conc entration of the [/-statistics U m — UB, m can be 
alternatively derived from Theorem 3.1 in [HR B03j ] and the evaluation of the terms A,B,C,D,F 
in this result obtained in the proof of Lemma 6.2 in Lerl2al |. We provide in Proposition [J] a result 
valid in general measurable spaces. This extension i s usefu l for densities on M. d or when the data 
£i :ri are only assumed to be mixing, see for example |Lerll| . 

The sharpness of the deviation terms in Proposition H] can be assessed by comparing them to 
the exact variance computations of Theorem [2] and Proposition [T5l This comparison is made in 
the case of regular histogram models — for which all terms can be simplified-by Proposition 
Section 



m 



Appendix C. Elementary properties of least-squares density estimation 

This section gathers some classical results on least-squares density estimation that will be used 
repeatedly in the proofs. 

For all m E A4 n and any non-empty 4c{l,...,fi}, 



(26) 



S£> = argmin, e5m { F^ 7 («)}=£ (p^x) 1» 



AeA„ 



Classical computations show that 
(27) s m = arg min { -2Pt + \\t\\ 2 } = Yl ( P V>a)V>a , 

so that 



AeA„ 



(28) 



{P { n A) - P){^ ] - *m) = E ^ ~ 



AeA„ 



for every A C { 1, . . . , n}, by using Eq. (|26j) . 

Let (X, A, fi) be a measured space and let 5a := span (ip\ ) AgA be a linear subspace of measurable 
functions. Let Ha be the orthogonal projection on S\ H -C^A*) w.r.t. the scalar product (t, u) = 
f tudfi. Let B A := {t G S A s.t. \\t\\ < 1}. 

Lemma 5. Let f be a function in L2(/i). Then 



(29) 



sup 



tfdfi 



l|n A (/)ir 



21 



In particular, for any linear map L : S\ — > M such that X^AeA (-MV'a) ) 2 < oo, we have 

(30) £(L(V>A)) 2 =sup(L(t)) 2 . 

ASA ieB A 

Proof of Lemma\^ For every t G 5a, 

| t/efc = <t, /} i2(M) = <t, n A /) La(M) 

since IIa is the orthogonal projection on S\. By Cauchy-Schwarz inequality, 



Vi G 



< ||t|| ||n A /|| < ||n A /|| 



with equality when ||n A /|| = or when t = (||n A /||)- 1 n A /. This proves (1291) . 

Now, X^AeA (-^(V'a)) 2 is the square of the ^ 2 -norm of the sequence (L(ipx))\ e \, hence, from ([29 
and the linearity of L, 

£(£(Vfc)) S 



AeA 



sup 



fV/3 A L(VA)^ = sup(L(t)) 2 
i VaeA / * eB A 



□ 



Choosing respectively L(t) = ip\(t), L(t) = t(x) — Pt a nd L(t) = (Pjy — P)t, (l29j) readily implies 
the following corollary, which can be found essentially in |Mas07 |. 



Corollary 6. Assume that for every x E X , X^AeA ( 1 I ; \( X ) ) 2 < then X^AeA ( ^\{ x ) ~ Pty\) ) 2 < 
+oo and, \/x G X, 



(31) 
(32) 



V (^a(x) " Wa) ) 2 = sup { (t(z) - P(i) ) 2 } 



AeA 



^(^ A (x)) 2 = sup{t(x) 2 } . 

AeA ieffi A 

Moreover, V£i, . . . , ^jv € -Y, 

(33) sup{((P 7V -P)t) 2 ) = V((Pjv-P)Va 



AeA 



Proof of Corollary^ We only have to check that X^AeA (V'aOO ~~ P(ip\)) 2 < 00 > this follows from 
the fact that X^AeA (V'aOe)) 2 < +°o by assumption, and that X^AeA^V^) 2 < oo since it is equal 
to the L 2 -norm of Il A s. □ 

Exact formula for the ideal penalty. Note that 

(34) pen id (m) = 2(P n - P)(s m - s m ) + 2(P n - P)(s m ) = \\s m - s m \\ 2 + 2(P n - P)(s m ) , 
using Eq. (128f) . so the following lemma provides an exact formula for the ideal penalty. 
Lemma 7 ( |Lerl2b| ). For all m G M n and A C {1, . . . , n}, 

2 f / ^2^ ^ . 1 



sup f((Pf-prfl 
ieB m I v 7 J 



and 



3 m 



n 



Proof of Lemma^ T he first equality comes from Eq. (I28p and (133(1 (Corollary [6]). The second 
equality is proved in |Lerl2bl . Lemma 6.11]. □ 
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Control of some remainder terms. We will use repeatedly in the proofs that under assumptions (|H1|) 
and (|H4l) . 

b 2 m < Lj D m + \\s m \\ 2 ) <lJ 1 + hA. \ Rm , 



v m < b m \\s\\ < 



N 



'R 



which can be rewritten as follows by defining C± := ( ( 1 + •U— =r- ) ) ( I v 



(35) 



&m < C$Rm and t& < Cl\/Rrn < 



m — ^* V ±u m 



'R* 



-Rr, 



Appendix D. Proof of Proposition [3] 

A key result in the proof of Proposition [3] is the following lemma about the "covariance" of the 
weights W{. 



Lemma 8. Assume that V > 2 and n > 4 and i/iai (|H5*|) holds. For all i € {1, ...,n}, Zei i£j 6e 
i/ie index of the block Bk such that i 6 Prv -For all i,j £ { 1, . . . , n}, we Ziawe 



1 



V" 



(36) 4J F) := E ((Wi - l)(Wj - 1)) = y-^ - jy—^iK^ ■ 



Lemma [8] is proved in Section [I.4.1I 
We can now prove Proposition [3l Let us compute pen VF (m) on an orthonormal basis {^\)\^K m °f 
S m . We have, from Eq. (HJ) 



■ AeA r , 



pen VF (m) = 2CE W ((Pf - P n ) -s m ))= 2CE W [ £ [(Pf - P n )V> 
= 2CE W ( £ [(Pf -P n )(^A-PV-A)] : 



. AeA„ 



10 

(37) = ^E E E[(Wi-l)(W i -l)](^ A (X i )-i'^A)(V'A(^)-P^) 

AeA m l<i,i<n 
Thanks to Lemma [HJ we deduce that 
pen VF (m) 



2C n 2 (V-l) 



E E (MXi)-Pfa)(MXj)-Pil>> 



AeA m l<i,i<n 



_y i_ 



<y - 1) 2 ™ : 



E E E (MXi)-pi>x)Wx(Xj)-pi>\: 



AeAm K^if' =1 ieif, jeif' 



V 



V-l 



^(Um-Uts^) byEq. (EH). 
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□ 



Appendix E. Proof of Proposition 3] 



From Proposition [3] it is sufficient to prove concentration inequalities for Us t7n , U m and \\s m — s|| 2 , 
which is respectively done by Lemmas [10] and QT] below. 

In the proof, we use in particular that R* < R m < n by (|H4p . at least for n large enough. 
So, when applying Lemma [11] with a = n, we have min {R*,n} = R*. Furthermore, we get 
R*/(xn) < 1/x < 1/2 since we can assume x > 2 (otherwise, the probability bounds are greater 
than 1), so this term that appears in £5 when applying Lemma [TUl can be merged with the constant 
term in e% (m, V, x) . □ 

First, we prove a general concentration result for U-statistics of the form of Ug m that is valid 
with no assumption on the partition B. 

Lemma 9. Let S m be a linear space of function and let {ip\)\£A m be an orthonormal basis of 
S m . Let {BkSk = \ be o- partition of {1, ...,n} and for all K = 1,...,V, let ur = Card(i3ft-). Let 
Z = n 2 Uis,m- Then, an absolute constant C > exists such that, for all x > 0, for all rj £ (0, l/y/x\, 
we have 



P\ \Z\ < C r,D, 



In particular, if TZ m > and v > satisfy 
(38) 

taking ij = ex in the previous inequality yields, for all x such that e\fx < 1, 

,2\ 2 



E 



Vb 2 x A 



if 



> 1 



Dm — T^mi V m ^ V T^mi b m < n m , 



P \U B ,m\ < Cx 



n r . 



e + 



n 



v 



\ K=l 




> 1 



„2-z 



Proof of Lemma\Q For all K = 1, V, let 

z K = E (MXi)-P*l>x)(MXj)-Pil>\)- 

As the random variables Zk are independent, we can apply [BBLM05, Theorem 2] (see Lemma 
to get that 



|Z|I„< 



V 



q \\ z k\ 



2 

q/2- 



K=l 



From [Lerlll . Corollary 4.3 in the supplementary material] (recalled in Corollary [29]) , we have, with 
probability larger than 1 — 4e~ x , 



\Z K \ < C (en K D m + ^fv 2 m x+ 



Integrating over x (see Lemma [31] for detailed computations) , we deduce an absolute constant 
C > exists such that 

\\Z K \\ q/2 < C (en K D m + ^v 2 m q + ^ 
Hence, there exists an absolute constant C such that 



\Z\\ q <C\eD m 



V 



? ^3/2 



K=\ 



\ K=l 



Vbla 5 / 2 
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Using AiiO 71 . Lemma 8.10] (recalled in Lemma [30j) . we obtain that there exists an absolute constant 
C such that, with probability 1 — e 2 ~ x , 



\Z\<C\ eD n 



K=l 



V E < + 

\ K=l 



Vblx 5 / 2 



We conclude the proof, taking e = rj/y/x. 

In particular, for regular partitions B, Lemma implies the following. 



□ 



Lemma 10. Let £i :n be i.i.d random variables and let S m be a linear space satisfying Assumptions 
(|H1|) . (|H4|) , For all m G M. n , let (V'a) AeA m be an orthonormal basis of S m . Let [Bk)k=i,...,v be a 
partition o/{l,...,n} satisfying (|H5*j) . Let C+ be the constant defined in Eq. (|35p . Let R+ be such 
that < i?* < R m and 

e 3 (n,V,x) 



yi/3 



v 



v 



.rn 



Then, an absolute constant L > exists such that, for all 2 < x < 2 R 1 J 2 A y 1 / 6 -?^ 4 , 

ax 1 / 2 /l, ' 



I^B.ml > Le 3 (n,V,x)- 



,1/4 



< e 



2-z 



Proof of Lemma [TU. From (|H5*j) . we have nx = n/V, hence 

V 



£ 

K=l 



n 
V 



thus 



1 

n 



V 



\ K=l 



11 



K 



Condition of Lemma M holds with v = \fxR± ^ 4 , 7£ m = C^i? m from 
this lemma that there exists an absolute constant L such that, for all e > satisfying ey/x < 1, 



We deduce from 



P ( \U B>m \ < C+Lx^ 
\ n 



v 2 

e+ — 

e 



1 yV 



> 1 



„2-x 



Let e = V l l^v Ax 1 / 2 , we have 

v 2 e v 

V < 



V 1 
< — jV 



1/1 



-v \ < 



1 i2* 



V 



xR 



1/4 



□ 



Lemma 11. Lei £i :n be i.i.d random variables and let S m be a linear space satisfying Assumptions 
(jHljl . ()H4j) . Let (ip\)\eA m be an orthonormal system in S m and let .R* be such that < < R m . 
Let C* be the constant defined in (|3"5j) . There exists an absolute constant L such that, 



V2 < x < 



s/R* 
C 2 ' 



\U m \ > L 



C±y/x R n 

~rW~ 



and for every A C { 1, . . . , n} with cardinality a > 0, 



V2 < x < 



\/R± A a 
C 2 : 



^ -s 



Drn 

a 



> L 



I < 4e"^ , 

C*y/x Rn, 

{R^Aay/ 4 ~a 



<2e~ x . 



Proof of Lemma \11[ We combine Lemma [7] and Eq. (]35[) with two results from [Lerlll . supple- 
mentary material]: Theorem 4.1 (recalled in Proposition [28]) and Corollary 4.3 (recalled in Corol- 
lary [29]). □ 
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Appendix F. Proof of Lemma Q] 

Let us first recall here the proof of Eq. ([5]) (coming from Arl08l |) for the sake of completeness. 

By UESfD, P n ~ Pt BK) = V- l (Pi BK) - Pt BK) ) and pf * } — P n = (V — l)V~\Pk 
so that 



-1/ D W p(-B K h 



crit P en VF (m,i3,y - 1) := P„t (s m ) + pen VF (m, B, V - 1) 



Pn7 (%) + 



V —1 



K=l 



Pnl(Sm) + ^Y.[{ p( n 
K=l 

crit corriV FCv(™,B) • 



k) -Pu)i{^ Bk) 



7 \ ^ BK) 



We can prove Eq. ([6]) and ([7]) simultaneously, by proving a slightly more general result, namely 
Eq. (|4*2|) below. Let £ be a set of subsets of { 1, . . . , n} such that 



(39) 



VACS, Card(A)=p and - 1 V P^ = P n 

uard(c) ^-^ 



Let us consider the associated penalty 

C 



pen £ (m,C) 



Card(f) 



Y,( P n-Pt A) h(s 



m 



2C 



Card(£) 



Aee 



and the associated cross-validation criterion 



v ' Ae£ 

When £ = { B\ , . . . , By } , we get the 1^-fold penalty pen VF = pent- and the V-fold cross-validation 
criterion crityFCV = critg, and Eq. (|39p holds true with p = n/V under assumption (|H5*|) . When 
£ = £ p := { A C { 1, . . . , n} s.t. Card(^4) = p}, Eq. (f39|) always holds true and we get the leave- 
p-out penalty pen LPO = pen^ and the leave-p-out cross-validation criterion critLPo = critg . 

Let (ip\)xeAm be some orthogonal basis of S m in L 2 (//). On the one hand, using Eq. (|39p and (|26p . 
we get 

2C 



Ae£ 



Card(£) 

o^EE[(^(w-p.(«)pn* 



A J 



2C 



(40) 



Card(£) 

2C 
Card(£) 



E 

AeA m 



E (^ A) (^A)j - Pn(^A) £ P^Gfe 



E E 

AeA m 



4~^a)) -(PnWf 



2(i 



On the other hand, using that Pi A) = r ±P n -"^pi A) by (i39j). 
cr%(m) - P„7 ( 

^E[ P ^(^m it) )--P»7(?m) 



Card(£) 



Card(£) 



Card(£) E 
Card(f) 2 2 



AeA m Ae£ 



^ A) (V>A)) 2 - 2P^(^) J Pi- A) (^) + (Pn^A)) 2 



2n 
P 



(41) 



2n 
P 



i)c^)EE[(^»(«) 2 -(p«(«) : 



AeA m Ae£ 

where we used again Eq. (|2^) and Comparing Eq. IpTO]) and (|4"Tj) gives 

n 1 



(42) 



critf(m) = P n j (s m )+ pen £ m, 



p 2 



which implies Eq. flSj) and ©. Eq. © follows by [Lerl2b| . 



Note than Eq. (|8|) can also be deduced from Proposition 2.1 in Cell2l |. which proves that 



crit LPO (m,p) = — l — E^) 2 --^ E MXiWx&i) 

v ^ ; AeA,„ \ i=l #j=l 



Elementary algebraic computations show then that 

2n — p 



crit L po(^,p) - Pnli^m) 



n 2 (n — p) 



E E^ 



i 



n 



- E MXi)MXi) 



A6A m \ 8=1 

From Lemma Q] and the latter equation, we obtain that, for any p,p' 

U / P (crit L po (m,p)- Pnl{s m )) = -yr — 7777 (crit L po (m,p') - P n l{s m )) . 
n/p — 1/2 n/p — 1/2 v ' ' 

In particular, when p' = 1, from (jjj), since pen LPO (m, 1, C) = pen LO o( m > C)> 

n 1\ n/p -1/2 n-1 ( 1 

-pen LPO ( m,l,n- - 



pen LPO m,p, - 



p 2J n/p— 1 n — 1/2 
= pen LOO [m,(n- 1) 



n/p -1/2 
n/p — 1 



□ 



Appendix G. Proof of Theorem Q] 

The proof of Theorem Q] can be sketched as follows. First, we prove a general oracle inequality 
valid for any penalty approximately equal to C||s m — s m || 2 for some constant C > 1 (Lemma I12p . 
Then, we show that the 1^-fold penalty satisfies this condition, which is mostly a consequence of 
Proposition SI Finally, we check the assumptions of Theorem [1] imply those of Lemma [T2l 
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G.l. A general model selection theorem. 

Lemma 12. Assume (jHip and ()H4p hold true, and that 

By > 2, $ > , Bit G Mi(M n ) such that Vm,m'eM n , if x m := - In (7r(m) ) 



/D* \ ... a, III,™ — A^r 

y lx "n V 

Lei pen : .M n — > [0, +oo) 6e some penalty function and 

m G arg min I \\s m \\ 2 - 2P n (s m ) + pen(m) \ . 
meMn I. J 

Assume moreover, for y, {x m ) m< ^M n an d 71 suc h that QH2g[ ) ZioMs true, i/iat absolute constants 
[Li)i = i t 2, a sequence {z n ) n ^ and a constant c > 1/2 exist such that: for all m G M. n , a constant 
c m and a function £i m exist such that 



(CI) y + x m < z n , — -j= > and z n > +oo , 



(C2) 



Vx < c m , P | c 1 pen(m) - 2 ||s m - s m || 2 < L^i^x)^- X > 1 - L 2 e 



-a; 



Let v n := \f2§ {R* n ) 1/4 ^/z^ and e 2 ,m{x) 

CE\ m {x -\- £ m ) ~l~ z^n . Then, some hq > and absolute 
constants (-Lj)j=3,4 exist such that for all n > uq and for all x < y A inf mg _A/( (c m — x m ), with 
probability larger than 1 — L^e~ x , for all m 6 .M n , 

/ in\ 1-2( C - 1)_ -L 3 £ 2>m (x) ||2^||- ||2 

l + 2(c- 1)+ +L 3 e 2 ,m(2;) 

JTie constant uq depends on all parameters of assumptions (|H2g]) ; ([Cip . (|C2j) . but it does not 
depend on m. 

Remark 2. Lemma [12] is an extension of Theorem 3.2 in Lerl2bl |. which corresponds to the par- 
ticular case c = 1. We prove Lemma [121 here to study more general T^-fold criteria, since some 
interesting ones are biased, as explained in Lemma [TJ Lemma [12] will be used to prove both the 
results on l/-fold procedures and those on hold-out penalties (see Sections 17, II and iLlj) . this is why 
it is stated as a separate result. 

Before proving Lemma [T2l let us give a concentration inequality that we will need in the proof. 

Lemma 13 (Corollary of Lemma 6.8 in Lerl2b| ). Assume ( |H2g[ ) holds true for some ir, y, <£, and 
recall that x m = — ln(-7r(m)). Then, an absolute constant L > exists such that, for all x < y, with 
probability larger than 1 — e~ x , 

(44) 3m,m G M n , {P n - P){s m - s m >) < LV$- — 77-7777 + 

Lemma [13] is proved in Section [1.4.21 
Proof of Lemma[TM By definition of m and Eq. (|34p . for all m G M n , 

(45) \\s - s m \\ 2 < \\s m - s\\ 2 + pen(m) - 2 \\s m - s m || 2 + 2 ||%j - s m \\ 2 - pen(m) 

2(-fn i 3 )(s m Sffi) . 

The idea of the proof is to use that pen(m) pen id (m) up to multiplying factors (condition (|C2p ) 
and a centered term (that is concentrated by Lemma [13]) . Then, we will show that the remainder 
terms are negligble in front of the risk, in particular by using the fact that — s m II is concentrated 
around its expectation. 
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Construction of a favorable event £l(M n ). Let 2 < x < y A inf me _A^(c m — x m ). Recall that C* is 
defined above Eq. ([35]) . Let us define, for all m G -M ni Cm( x ) = C* ( R n \Ar + x m and for some 
constant L5 > to be chosen later 



Vm G 7W r 



3 m 



D m 

n 



< L 5 ( m (x) 



R„ 



tt 2 (M n ) := ^i(m,m')£M 2 n ,2{P n -P){s m -s m ,)<L 5 v n (^ + ^Lj\ , 
:= I Vm G M n , c" 1 pen(m) - 2 ||s m - s m || 2 < L 5 £i im (x m + x)^ 1 j . 

From ()Cip . some constant n\ > exists such that, for all n>n\,z n < C^ 2 ^/R n . As x m > and 
R n < n from (|H4p . for any 2 < x < y, we obtain 2 < x + x m < a/ i?* A n/ C 2 , hence, Lemma [TT1 
applies with i?* = R n , and, using a union bound, we obtain an absolute constant Lg such that, for 
all n > ni, if L$ > Lq, 

P{fti(M„) c } < 2 ^ e - * - *™ = 2e- x . 

m&Mn 

From ( |H2g[ ) and the fact that x < y, Lemma [13] applies. As x m + x m > + x < 2z n by (jCl|h there 
exists ri2 and an absolute constant L7 such that, for all n > n 2 , if £5 > £7, 

P{ft 2 (-Mn) c } <e"" . 
From condition (|C2p and since x + x m < c m , for all m G A^ n and all L5 > Li 

P{^ 3 (A^n) c } < L 2 e~ x < m ) = L ^~ X ■ 



m£Mr 



Hence, choosing L5 = max{ L\, Lq, Lj}, L4 = L 2 + 3, the event £l(M. n ) := Qi(Ai n ) n Q 2 (A4 n ) n 
^3(-A^ n ) satisfies P{S7(.M n ) c } < Z^e -21 if n > max{ni,n2}. 



Eq. (j43l) holds on f2(.M n ). From (jCip . there exists rc.3 such that, for all n > n 3 , L^C* (R n \ 
1/2, so on Qi(A4 n ), for all n > n 3 , we have 

Vm G M n , ||s"r 

Therefore, for L 3 = 2L5, on Q(A4 n ), for all m G Ai n and n > no := max{ni, n 2 , n^}, 



-1/4 



Zn < 



2 . 1 R rn 

s\\ > 

11 ~ 2 n 



pen(m) - 2 ||s m - s m || 2 < 2(c - 1)+ \\s m - s m \\ 2 + L 3 cei jm (x m + x) \\s m - s\\ 2 
pen(m) - 2 ||s^ - ?™|| 2 ) < 2(c- 1)_ ||s^ - 3™|| 2 + L 3 cei ) ^(aJa + x) - s|| 2 

2(-P n - P){s m - Sfh) < L 3 U n \\s - Sfh\\ 2 + L 3 Z/ n ||s - S m || 2 . 



Plugging these inequalities into Eq. ([45]) yields Eq. (03]). 



□ 



G.2. Proof of Theorem [TJ We will use in the proof the following lemma, showing the assumptions 
of Theorem [1] imply assumptions QH2g[ ) and (|C1|) of Lemma [T2l 

Lemma 14. (fH2~|) . (]H2'|) or (|H2^|) together with (jHlj) . (|H3|) . ([Hi]) zmp/y ( |H2gl ) and (fCT|) . wi^ 

7r i/ie uniform probability measure on A4 n , y = 21nn + ln(LJ 1 ) and $ depending on the parameters 
appearing in the assumptions that hold true. 

Remark 3. Lemma [14] is the only part of the proof of Theorem [T] where we use one assumption 



among (TH2]) . (jPmi or (TH2^) . 



29 



Proof of Lemma JJ. (|H3P ensures that x m = O(lnn). Under (|H2'|) . we have 



v m.m> < SU P / t2sd V < \\s\ 



and 



As 



'- > \fM, > \/c R (lnn) 2+r / 2 , (ICTl) and ( |H2g| ) hold for some constant $> (Hs^ , T, c~) 



Under (|H2°p . using successively the inequality P((t — Pt) 2 ) < P(t 2 ), Cauchy-Schwarz inequality, 
the triangular inequality and (|Hip . we have 



u m,m' — 



sup P 



< sup 



^2 ax ^ x 

AeA m uA m ; 



^2 ax ^> 

AeA m uA ra / 



^ a\ip> 



AeA m uA m / 



< ( Om + b r , 



\s\\ < 2 llsl 



Rm V i? m ' 



b 2 m , m/ <2(b 2 m + b 2 m ,)<j-(R m VR m ,) 



These inequalities yield also ( |H2g[ ) and (|Cip , Under Assumption ()H2|) . we can choose a basis of 
L 2 (fi) such that (|H2^) holds, hence (lH2~|) ensures also JH2g| ). □ 



Proof of TheoremU\ We apply Lemma [T2l PropositionE]ensures that Condition (1C2D in Lemma[T2l 
is fulfilled with Li for some absolute constant, L2 = e 2 + 4, 



c = k, Vm G .M , c„ 



Ay 1/6 (^) 1/4 , £l , m (x) = e 2 (n,U,x) . 



From (|H4p . inf me x n c m > Lln(n) 1+r / 4 . Let z n = (3 + a>i)lnn, hence, for all n large enough, 
x m = ln(Card(.A/f n )), x = 21nn + ln(LJ 1 ), we have x + x m < z n , x + x m < c m for all m £ A4 n , and 
z n /^/R^ — > 0. Since ()Cip holds for these values by Lemma [HI we deduce from Lemma [T2l that 
Theorem [1] holds for n sufficiently large. Taking L such that Lns(n, V) > 1 for too small values of 
n yields the result. □ 



Appendix H. Proof of Theorem [2] 



Theorem [2] actually is a corollary of the following proposition, since Cd = Cc = when C 
V — 1, and U-fold cross-validation corresponds to C = V — 1/2 (see Lemma [1]). 



Proposition 15. Let (ip\)\ e A m and (ip\)\eA m , 
L (fi), C > some constant, and define 



two finite orthonormal families of vectors of 



C B 



C 2 {n-V + l) 
(V-lf 
4(C-U + 1) 

= W77~t ~ — > 



rr 



+ 



2C 



+ n - 1 

2 



n 2 (V - 1) 



4(C-U + 1 
(U - l) 2 ra 3 
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Assume that B satisfies (|H5*j) and pen VF (m) = pen VF (m; B; C). Then, with the notation of 
Theorem^ and noting /3(A) = f3 (A, A) and similarly with 7 (■ ) , £ (•), for every m G .M n , 



(46) 
(47) 
(48) 



4 8(1 _ n -!) 4 4 

var(pen id (m)) = -var P (s m ) H 5 /3 (A m ) + ^7(A m ) + -^C(A m ) , 



n 



n- 



rt. 



var(2 ||? m - s m || 2 ) = 8 ^ " l p ( A m ) + 4rC ( A m ) , 

. 8C 2 (n-V + l) . AC 2 A ,. 

var (pen VF (m) ) = ^— - — — „ — p ( A m ) + -^r— - — — ? C ( A r , 



n 3 (^- l) 3 



n 3 (F- l) 2 



(49) var(pen VF (m) - pen id (m)) = - var P (s m ) + C B P (A m ) - Cc7 (A m ) + C D C (A m ) , 

n 

/or every m, m' G -M n; 

var ((pen VF (m) - pen id (m)) - (pen VF (m') - pen id (m'))) 
4 

= - varp (s m - s m /) + CpB (A m , A m / ) - C C C (A m , A m > ) + C D D (A m , A r 
n 

and for every C > and m £ Ai n , 

var(P n 7(s m ) + pen VF (m; S; C) ) 



(50) 



4 2 
- varp ( s m ) + 



n 



n- 



1 + 



AC 2 



1 / 2C 



(V - l) 3 n V V - 1 



/3(A m ) 



^2 V V - 1 



l !7(A m ) + ^(^ I 



1 C(A„) 



The proof of Proposition 1151 relies on the following lemma, by taking = V'a(AQ) — P(ip\). 

Lemma 16. Let A be a discrete set, Ai,A2 C A non-empty, n > 2, a G M n (M.) symmetric, 
(P\)\£A G M A and (^x,i)\pa ' ■ ■ ■ ' (^ a >")a£A a se Q uence of independent and identically distributed 
random variables with Vi, A, E = 0. For every q,r G { 1, 2} and A, A' G A, define 

v x :=E[e x>1 ] and C x 9, y := E |"Ca,i?a',i" 



Lei us assume that X^AeA ^A < °°' SasA^a 1 



< 00. Then, if for every A a C A, 



(si) z(a«)= Y, T, E(^A, i 

l<«J<nAeA a l<i<nAeA a 

cov(Z(A 1 ),Z(A 2 )) = n £ 

AgAi , A'eA 2 



'J 



E 

AeAi , A'eA 2 



(52) 



+M E 

+ f E°m) E [/5a4 2 ?+/3a^ 

Vi=l / AeAi,A'eA 2 



E 

AeAi , A'eA 2 



E WA 

AeAi 



E ^ 

AeA 2 



Lemme[T6](p. l3Tj) is proved in Section [1,4.31 We can now prove Proposition 1151 
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Proof of Proposition \15[ For all i, j G { 1 . . . , n } and A G A := A m U A. m / , let 

£ Aii = ^ x (Xi) - Ptyx) and =M[(W i -l)(W j -l)] , 

where, for alH = 1, . . . , n, Wi = (1 — n~ l n j)li^Bj is the F-fold weight vector. For every q, r > 
and A, A' G A m U A m /, let u A = u A and = We have 

£ [p(^ A )p(^v)c ? i!i ) 

AeA m ,A'eA m , 



e £ (p(Va) (Va(6) - PM)P(M (<M6) - p(^aO)) 

_AeA m , A'eA m / 

(53) = E[(s m (£i) — E [s m (£i)] ) (s m /(6) - E [s m '(£i)])] = cov P (s m , s m /) 
and remark that 

(54) var P (s m ) + var P (s m /) - 2cov P (s m , s m /) = var P (s m - s m >) . 
Ideal penalty. For the ideal penalty, we simply notice that 

pen id (m) = 2{P n - P){s m ) = 2 ^ [(P n ^ A - P^aX^a)] 



(55) 



AeA m 

^ E E + \ E E 

J< n AsA m l<i<nAgA m 



Therefore, pen id (m) is of the form ([5"Tj) with A a = A m and 

Vz,j € {l,...,n}, aij = —^ and V A G A m , p A = , 

so that, by Lemma [T6l and Eq. (|53p . 

4 8(n — 1) 4 4 

var(pen id (m)) = -var P (s m ) H = — /3 (A m ) + -=7(A TO ) + -^C(A m ) . 



n 



n" 



ri" 



Proof of Eq. (07]). Since 



E E E 



AeA„ 



l<i,j<n AeA„ 



s m — s m || 2 is of the form (j5Tj) with A a = A m and 

Vi, j G { 1, . . . , n} , Oi = and VA G A m , /3 A = , 

rr 



so that, by Lemma [161 



var | \\s m — s m \ 



2\ 2(n-l) 



^(A m ) + ^C(Am) • 



V-fold penalty. It follows from Eq. ()37|) that 

2C 



(56) 



pen VF (m) 



E E (*S"<v«u) . 



i<*.i<« AeA„ 



where ivY*" is computed in Lemma 



'•j 



V/, J G {1,... , V}, 6 B It Vj G Bj, ^ ( J F) = ^ - 
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since all blocks Bj have the same size n/V. So, 



(57) 



i=l 



(VF) 



V-l 



EK 



i=l 



.(VF) 



2 ' 



Using in addition that 

(58) E< F) + E 3 

l<i<n 

Eq. (fB"T|) implies that 
(59) 



(VF) 



e 



(VF) 



E 



(V-l) 



E^ - ^r, 

V 1=1 



o , 



E K F) 



-n 



V-l 



In addition, 



e « 

l<v£j<n 



(VF) 



n — — 1 



1 



+ n n 



n(n - V + 1) 



W(y-i) 4 (V-l) 3 



According to Eq. ([56]) . pen VF (m) = pen VF (m;i3; C) (without assuming C = V — 1) is of the form 
(|5ip with A a = A m and 



Vi, j £ { 1, . . . , n} , cxij = 
From Eq. ([57|) and (f60|) . we obtain that 

4C 2 



2CK 



(VF) 



» ,3 



71" 



and VA G A m , /3 A = 



E°M 



i=l 



n 3 (y-i) 2 



and 



E "L- 



AC 2 (n-V + l) 
n 3 (V- l) 3 



Therefore, by Lemma [16] 
(61) 



, . 8C 2 (n-F + l) Q .. . 4C 2 s- 1 k \ 

var(pen VF (m)) = T — — — » — /3(A m ) + ^r— - — — ^C(A m ) . 



n 3 (F-l) 3 MV " l/ n 3 (F-l) 2 
Difference of U-fold and ideal penalty. According to Eq. (|55p and (|56p . 

(pen VF (m) - pen id (m)) = Za (A m ) , (pen VF (m') - pen id (m')) = Za (A r , 
where ^a(') is defined by Eq. (|5"T]) with 



Vz, j G { 1, . . . , n} , aij = ^ ( C^J F) " 1 ) and VAGA, # 



-2P(V> 



r? 



So, using Eq. (j5T]) . (f59|) and (JBDJ), we have 



\- 2 4 

E«M = ^4 



^ 2 E(^J F) ) 2 -^E< F) + 



i=i 



i=i 
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4(c-y + i) 

(V - l) 2 n 3 



E 



a 



h3 



71 1 



n 1 
4 

n 3 



<? E K F y- 2C E < F) +«("-D 

C 2 n(n - V + 1) 2Cn 



+ 



v-i 



+ n(n — 1) 



C 2 (n-y + l) 2C 



E 



(V-l)3 

2(C-F + 1) = Cc 
n(V-l) 2 



+ 



V — 1 



+ (n - 1) 



2 



Therefore, by Lemma [TU] with A a = A m , A& = A m / and Eq. (|53|) . 

4 

cov (Z A (A m ) , Z A (A TO /)) = - covp (s m ,s m / ) + Cp/3 (A m , A m / ) - C C 1 (A m , A m /) + C.oC(A m , A m / ) 

n 

which gives Eq. (|49p when m = m' . Eq. (??) follows then from Eq. (|54p and the relation 

var (Z A (A m ) - Z A (A m /)) = var (Z A (A m )) + var (Z A (A m /)) - 2cov (Z A (A m ) , Z A (A m /)) . 
V-fold penalized criterion. Let Zc(A m ) = (s m ) + pen VF (m; B; C). Then, 



Pnl(s m ) 



E ( P n(^)) 2 



AeA„ 



E {(Pn-p)m?-2 e TOa)(p»-p)(^a)]- E 



AeA„ 



AeA r , 



AeA r , 



n 



r E E - E 



AsA m l<i,j<n 



AeA r , 



i=l 



E ( p (^)) 2 

AeA m 



So, by Eq. K5BJ, Z c (A m ) + £) A A (P(^a) f is of the form of Eq. with A a = A* 



2CE? /F) - 1 



and VA € A m , /3 A 



Vi,j G {l,...,n}, Ojj 
where i?j is defined in Lemma El Similarly to computations for Z A , 



-2P(V>> 



n 



E 



a,;,: 



i=l 



2C 



and 



E 



n V V - 1 



1 



i=i 

4C 2 (n-T/ + 1) 



1 



2C 



+ 



n 3 \V-l 
4C 



l<i^j'<n 

so that Lemma [TBI and Eq. (|53|) imply 

4 2 
var ( Z c (A m ) ) = - varp (s m ) + — 
n n A 

_ _2_ / 2C 
~ n 2 V V"- 1 



(V - l) 3 ' V - J 
4C 2 (n-F + l) 



+ n - 1 



4C 

+ tt — r + n-1 



F-1 



/3(A m ) 



1 



1 7 (A m ) + ^ 



2C 



n A \V -I 



1 C(Ar, 



which proves Eq. (l50|) . 



□ 
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Appendix I. Supplementary material 



The supplementary material is organized as follows. First, results concerning hold-out penal- 
ization are detailed in Section 11,11 with the proof of the oracle inequality stated in Section 17,11 
(Theorem [3]) and an exact computation of the variance. Section |L2] gives expressions of the main 
terms appearing in Theorem [2] independent of the basis and evaluate these terms in the particular 
case of regular histograms. Section |L3] provides complements on the computational aspects stated 
in Section [6] In particular, we state and analyse the basic algorithm for computation F-fold crite- 
rions and we give the proof of Proposition [5J Then, several technical results of the main paper are 
proved in Section |L4"1 In Section lL5l we extend the results of the paper to pseudo-regular partitions, 
that is, when assumption (|H5*j) is replaced by (|H5p . Some useful probabilistic tools are recalled 
in Section 11.61 Finally, some simulation results are detailed in Section 11.71 as a supplement to the 
ones of Section [5j 



1.1. Results on hold-out penalization. This section gathers the results we can prove on hold- 
penalization, similarly to the results already proved for IZ-fold penalization. 



1. 1.1. Oracle inequality: proof of Theorem^ First note that (|H2|) implies ( |H2g[ ) (Lemma ll4p since 
we assume (|Hip . (|H3P and (|H4p . So, we will prove Theorem [3] with (|H2p replaced by QH2gp , 

Recall that for any T C {1, .., n}, n t = Card(T), n v = n — nt and (V^AeA™ denotes an orthonor- 
mal basis of S m . The hold-out penalty is equal to 



(62) 



pen HO (m) = 2C(Pf ) - P^Xffl - 4~ T) ) = 2C^[ (if - P^ 



T) 



AeA,, 



As for Theorem [TJ the sketch of the proof of Theorem [3] is to show the conditions of Lemma [12] 
are satisfied with the hold-out penalty. Since we need a concentration result for pen HO (?n), we start 
by an exact formula for the hold-out penalty (Lemma 1171 analogous to Proposition [3]) . Then, we 
get the concentration of pen HO ( m ) with Lemma [18] (analogous to Proposition 2|) and Lemma [TT1 

Let us first state and prove the two lemmas that will be necessary in the proof. 



Lemma 17. For all m £ A4 n , we have 



pen HO (m) = 2C 



s< T ) -s 



+ 



s<- T ) - S 



2(Pf ) - P) ( 3<-T> - s 



Proof of Lemma By definition, we have 

pen HO (m) =2C £ | ((p(" T ) - P)^ x f + ((pP 
AeA m 



P)^> 



E {i((rt T) -P)i>x)((PP-PWx)} 



AeA„ 



2C 



+ S 



-2(PP-P)[ E ((^-^a)^ 

AeA,,, 



□ 
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Lemma 18. An absolute constant L > exists such that, for all m G A4 n and x > 0, with 
probability larger than 1 — 2e~ x , 



(63) (PP -P)(s 



a; x 
V 



>l/4 



S<-t) 



2 /? 



Proof of Lemma \JR Let us apply Bernstein's inequality conditionally to (^i)i^T to the function 
t = (s^ T) - sm) • From Eq. ® , 



-<-T) _ 



< 



var 



(^(0-*m(o| (e<)*r) 



< 



~(-T) _ 



3<- T) - s 



' 2 ^ C+R m 



s<- T) - s 



s<- T) - s 



Hence, for all x > 0, with probability larger than 1 — 2e x , conditionally to (£i)i£T> 

CirX \ 



< 



-<-T) _ 
S m S r 



C 2 R m 



2x 



+ 



+ 



C+Rm 

rjn t 



'R* 3y/n t j 
2x C 2 x 2 



iR* 



9n t 



As the bound on the probability does not depend on the same inequality holds uncondi- 



tionally. We choose rj = (y/x (R^) 1/4 ) A (xn t 1 ^ 2 ) to conclude the proof. 

Proof of Theorem^ For all x > 0, let e„ (x) := (i?* An„ l\nt)~ x l 2 \[x. We deduce from Lemmas [TT| 
[T7] and [18] that there exist absolute constants L,L' such that, for all x < C~ 2 -i/ i?* A n v A nt , with 
probability larger than 1 — L'e~ x 



□ 



Vm e .M n , 



pen HO (m) 



2 lis 



Cn 2 (n v nt) 1 

Therefore, conditions (|C2|) of Lemma [T2l hold with 

,2 



m »m | 



R, 



<Le(P(x)^ 



n 



c = C 



n 



n v n t 



ex, m {x) = eP(x), 



y/R* A n t A n v 
C 2 



Moreover, from Lemma fHl Condition (jCip holds with the uniform probability measure ir on M n , 



y = 2 In n + ln(L 4 1 



4 7 j 



ln(-7r(m) x ) + y. From Lemma [T2l we deduce that there exist a constant 



L such that, with probability larger than 1 — n , Vm G A4 n , 
l+{2C£^-l)-L(c£^Vl)e£\lnn) 



Sri 



ii2 < ii- ii2 



□ 



1.1.2. Variance. 



Proposition 19. Assume that card(T) = m G { 1, . . . , n — 1 }. Then, with the notations introduced 
in Proposition 1 151 for every m,m' £ -M n; 



(64) var( pen HO (m)) 



4C 2 



ir. 



n 

--1 ) +1 



C(A m ) 



+ 



8C 2 n 



( -3n 2 + n(3n 4 - n 2 ) + n 2 (n t - 1) ) (3 ( A m ) , 



(65) var ( pen HO (m) — pen id (m)) = — varp (s r 



+ 8 


r c 2 


n3n t 3 


8 


/ Cn 


n 





( — 3n 2 + n(3n t — ra 2 ) + n 2 (n 4 — 1) ) 



- 7 (A m ) + 4 



2C n-1 
+ 



nn v nt n c 



' c 2 




2C 1 " 




+ — 

nn v nt n 6 







C(A m ) , 



and for every m, m! 6 M n , 

4 

(66) var ((pen HO (m) - pen id (m)) - (pen HO (m') - pen id (m'))) = - var P (s m - s 
+ 8 [4^r (-3n 2 + n(3n t -n 2 )+n 2 (n t - 1)) ^— + ^E_l 

7 *3 Tl 



nn v nt r 



8 / Cn 1 



n v n^ni n 



C(A m ,A m /) + 4 



' c 2 




2C 1 " 




+ — 






nn v nt n J 



B ( A m , A m / ) 

D(A m ,A r , 



Proof of Proposition llffl By definition we have n v Pn = nP n — ntPn , i.e 

>CT) 



P {T) _ p 



p(T) _ p(-T) p{T) _ nP n ~ n t Pn 

Therefore, we have 

pen HO (m) = 2C(Pf ) - P^W^ ~ 4" T) ) = 2C £ [ (pCO - pHH ) ^ 



AeA r , 



2C 



2C 



n 
n v 

n 



7 AeA. 



E [(^ r) - ■ 

AeA m 



(67) 



2C7 i E fE(£i te T-i)(^)-p^) 

" AeA m \ i=i v 
i n 

*4 £ E 



' AeA m ij=i 

In the previous inequality, we wrote, for all i, j G { 1 . . . , n} and A £ A r 



^i^ATO-WA) and 4f 0) " 



1 



n 
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the hold-out weight vector. We have 

2 



E 

Therefore, 
(68) 

(69) 



(HO) 



n 



ii 



1 ) 1 »,ieT + I 1 ) lieT,j$T + I 1 ) IjgT, jeT + li,j£T 



n 



E E . 



(HO) 



i=l 



n 



i=i 



«t 1 + (n - rtf) 



n(n — nt) 



n t 



n t ( — - 1 ) + (n - n t ) = (n - n t ) 



n 

--1 I +1 

n t 



Eq. ([58]) also holds for the hold-out weight-vector, i.e. 



(70) 

so Eq. ([68]) implies that 
(71) 

In addition, 



E K 



(HO) 



E 



l<i,j<n 



E 

v i=l 



n 



E K°') 



l<i^:j<ra 



E« 



(ho) _ -n(n - nt) 



i=i 



£ (^)=2n t( n-n t ) 

l<i^:j<n 



2 

1) + 



n t (n t - 1) f^- - l\ + {n - n t ){n - n t - I) 



(n - n t ) 3 ( {n t - l)(n - n t ) n t (n - n t - 1) 



»7 



(n - n 4 ) s 



(72) 



3 "') (_ 3n 2 +n(3nf _ n 2 )+n 2 (nf _ 1)) 



According to ([67]) . pen HO ("i<) is of the form ([5T]) with 

2CE (HO) 

Vz,j G {!,..., ra}, «ij = — and VA G A m , /3 A = 



n- 



From Eq. (|69 [) and ([72p . we obtain that 



E 



4C 2 



1,1 9 



n 

1 I +1 



and 



4C 2 n 
'' n^n 3 



( -3ra 2 + n(3n 4 - nf ) + n 2 (n t - 1) ) 



l<i^j<n 

Therefore, by Lemma [16] applied with Ai = A2 = A m , 



var(pen HO (m)) 



(73) 



4C 2 



rr. 



n 
n t 



1 +1 



C(Ar. 



+ 



8C 2 n 



( -3n 2 + n(3n 4 - n 2 ) + n 2 (n t - 1) ) ( A m ) 
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According to Eq. ([67]) and ([55]) . pen HO (m) — pen id (m) is of the form ([5"T]) with 



Vi,j G {!,..., n}, ay = 2 C- 



p(HO) 



1 



n 2 , n 2 



and VA G A m , /?> 



-2P(Va) 



So, using Eq. ([68]), (EH), (EO and ([72|), we have 



8=1 



(74) 



n 



u 8=1 
2 

3~ 



2C 

n 2 n 2 .__ s 



+i 



2C 



nn v nt n 



+ - 



E a l = 4 

1< ij^j<n 



(75) 
(76) 



C 2 / hi 

c 2 



iliO)\ 2 -C 



(HO) n-l 



n 2 n 2 



n 



2C n-l 



n l n t 



3n 2 + n(3n t - n\) + n 2 (n t - 1)) + + 



nn v nt n° 



A / Cn 1 \ 
Va^ = 2 

8=1 x ' 



Therefore, by Lemma [16] with Ai = A2 = A m and by Eq. (|53p . we deduce 



var (pen H0 (m) — pen id (m) ) = 4 
C 2 



c 2 



I?.', 1 



^] +1 



+ 8 



( -3n 2 + n(3n t - n 2 ) + n 2 (n t - 1) ) 



2C 1 

+ — 

nn v nt n A 

2C n-l 

+ TT- 



nn v n t n° 



C(A„ 



8 f Cn 1 \ . . . 4 . . 

7(A m ) + -var P (s m ) . 

n \ n v nt n J n 

Let us now remark that 

pen HO (m) - pen id (m) - pen HO (m') - pen id (m') = Z(A m ) - Z(A r 

where, for all a G {m,m'}, Z(A a ) is defined by Eq. ([ST]) with 



Vi,j G { 1, . . . , n} , Oij = 2 1 C- 



E 



(HO) 



"7 



1 



and VA G A , /?a 



-2P(V> 



r? 



It comes therefore from Lemma [TBI and Eq. ([71]) . ([75]) and ([76]) that, for all a, 6 G (m,m') 



f\2 



cov(Z(A a ),Z(A b )) = 4 



"c 2 


W + 1 









2C 1 
+ 



8 


r c 2 






8 


/ Cn 


n 


\n v n t 



nn v nt n 
( -3n 2 + n(3n t - n 2 ) + n 2 (n t - 1) ) 



C(A a ,A b ) 



2C n-l 
+ 



nn v nt n° 



P (A a , A 6 ) 



- 7(A a ,A 6 ) + -cov P (A a ,A 6 ) . 
n 



Eq. (j66p follows then from Eq. f)54[) and the relation 

var ( Z(A m ) - Z(A m ,)) = cov ( Z(A m ), Z(A m ) ) - 2 cov ( Z(A m ), Z(A m , ) ) + cov ( Z(A m ,), Z(A r 
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1.2. More results on the variance. Theorem [2] and Proposition[15]provide general exact formula 
that can seem a bit too abstract. This section provides tools for understanding the terms appearing 
in these results. 

Evaluation of the terms in the variance term. First, we give a formulation of the terms appearing in 
Theorem [2] and Proposition [15] that do not depend of the basis (ipx) xeA m ■ We will use the notation 



of Theorem [21 Proposition [T5l and Section Recall that by Corollary [6[ $A m := J2x<=A ^ 
su PteB m t2 , T A m : = EagA™ £a,1 = ~ 2s m + \\s m \\ 2 are independent of the basis (^x)xeA- 

Proposition 20. For any m,m' G M n , we have 



(77) 

(78) 
(79) 
(80) 
(81) 



/3(A m ,A m ,) 



sup n A (ts] 



2-P \_S m S m i ] + 1 1 >Sm 1 1 



2 

m' || j 



7(A m ,A m /) = P(T Am (s m/ -Ps m >) + T Am ,(s m -Ps m )) 

= COVp ( * Am ,S m >) + COVp ( ^A m , , s m ) - 4 COVp ( s m , s m / ) . 

C (A m , A m , ) = covp (T Am ,T Am , ) = covp (^ Am - 2s m , ^ Am , - 2s m / ) , 
D ( A m , A m > ) = varp (T Am - T Am , ) = var P ( * Am - * Am , - 2(s m - s m >) ) 
C (A m , A m /) = 2 covp (^ Am - * Am ,,s m - s m >) - 4 varp (s m - s m /) . 

Proof of Proposition \2b\ The terms £(A a ,Ab): A direct computation shows that 



C (A m ,A m > ) :- 



E £«?-[ E-*)( E-» 

AeA m , A'eA„/ \AeAm / V AeA , 



E i E I - | E | E & 

AeA m ,A'eA, n , / \ \ AeA; 




e I E &,i 

A'eA. 



= cov £ £a,i, E Cam = covp (r Am ,r Am/ ). 

\AeA m A'eA m / / 
The result for D (A m ,A m / ) follows then immediately from the definition 

D (A m ,A m i ) = ( (A m ,A m ) + ( (A m ',A m i ) — 2( (A m ,A m i ) 
The terms 7(A m ,A m /): By definition, we have 



7 ( A m , A m , ) := J2 [fa^S? + P* C ix 
AeA m ,A'eA m / 

Y, /3a'E ((ip x - Pil>x?(i)\i ~ PM) + /9aE (U>x> - P^A') 2 (V-A - P^a)) 



AeA m , A'eA m , 

P (T Am (s m > - Ps m >) + T Am ,(s m - Ps m )) . 
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We have then, by definition 

C (A m , A m / ) := 7(A m , A m ) + 7 (A m /, A m > ) - 27 (A m , A m >) 
= 2P ( r A m (s m ~ Psm) + T Am , (s m , - Ps m , ) ) 
- 2P (T Am (s m > - Ps m i) + T Am , (s m - Ps m ) ) 
= 2P(T Am ((s m - s m ) - P(s m - 8 m i))) 
+ 2P (T Am , ( (s m i - s m ) - P(s m > - s m ) )) 
= 2P((T Am - T Am ,)(s m - s m i - P(s m - s m /))) . 

The terms /3(A m ,A m /): By definition, we have 

/3(A m ,A m ,):= J] (^V?)^ E ™*Wx,M 2 = E (P(Mx')-P^xP^ 
AeA m ,A'eA m / AeAm AeA ro 

A'GA m , A'eA m , 

= ^ (p(Va^aO) 2 -2 E p1>\P*I>\>p W\M + E (PTp\Pip\>) 2 

AGA m AeA m AeA m 

A'eA - A'eA, A'eA , 



£ (P(VaV'aO) 2 -2P[ + 



■ 1 2 
m' II 



AeA„ 
A'eA„ 



By definition of H A , , using Lemma 

(p(VaV>v)) 2 = E \\^A m ,(M\\ 2 = I E (n Am ,(^0) 2 ^ 

AeA m AeA m AeA m 

A'eA m , 

= / sup n A V oaV'as a>= / sup (n A (ts)) 2 d^i . □ 

□ 

Evaluation of the variance in the regular histogram case. In this section, we fix some integers 
dm, dm' > 1 and, for a £ {m,m' } consider the model S a of regular histograms on R with step size 
d~ l . In other words, 

A a := d^Z and VA G A a , J A = [A, A + d" 1 ) , ^A = \fd a li x . 

We also introduce the linear span 5* of 5 m U 5 m ' which is the set of histograms on the partition 
(I\ n /A')AeA m ,A'eA / °f We denote by d* the dimension of 5*. We define, for all a G {*, m, m'}, 
the orthogonal projection Il a onto 5 a and s a := II a (s). In addition to the general properties of 
least-squares density estimation, regular histogram satisfy the following: S a is stable by product, 

(82) Vx G R , VA; G N , Va G { m, m' } , ^(x) = d£ /2 , 

AeA a 

(83) Va G { m, m } , VA, A' G A a , VaV>a' = y/d~ a l\=\'ip\ , 

(84) Va G {*, m, m' } Vt G S a , V/ G L 2 (^i) , U a (tf) = tU a (/) . 

Proo/ 0/ (|83]> . For any i G 5 a , 

tf = m a (f)+t(f-U a (f)) , 
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with tU a (/) G S a since S a is stable by product, and t(f — IL a (/)) is orthogonal to S a since 

\/uGS a , (t(f - n (/)), n) L2(M) = ((/ - n a (/)), tu) L2(M) = 

since tu G S a (using again S a is stable by product). □ 

In particular, from Eq. (f82|) . we have Va G {m, m'} , = d a is constant. We will also use that 
in general, 

A6A n 

The following proposition gives the orders of magnitude of the terms involved in Proposition [20] 
for such regular histogram models. 

Proposition 21. For all a, b G {m,m'}, we have 

C(A a ,Ab) = -7(A a ,A fe ) = 4cov P (s a ,s 6 ) 

(85) so that D (A m , A m ' ) = -C (A m , A m / ) = 4var P (s m - s m >) . 

Moreover, assume a constant L > exists such that 

(86) VA G A m , VA' G A m , , I A n iy IT 1 ^ 1 < MA n Jy ) < Ld" 1 . 
Then, we have 

A II 1 1 2 _. t II ||2 r dmd m r ,. ||2 ^ t> / a a \ \ n r> ( ( -v2 "\ 1 1 1 1 2 1 1 | 

^m 1 1 1 1 ~r "m' 1 1 ^m' II ; 1 1 S* II — V i **m' ) ' I S m > ) I I 1 1 S m 1 1 1 1 S m i \ 



(87) < d m \\sm\\ 2 + d m i ||s m '|| 2 — 2L 



i d m d m i 



d* 



Remark 4. Eq. (1871) is of particular interest when 5 m C since then Eq. (1861) holds with L = 1 
and <i* = d m /, so that 

B ( A m , A m / ) — d m || S m || + (dm' 2dm) \\Sm' || 2P ^ ( S m S m ' ) ^ + ^ || S m || || Sm' || 

— dm || || "I - (d m ' 2d m ) \\s m ' || varp(s m s m ' ) P ^ ( % ) ^ 

We have 

< varp(s m - Sm') + P ( (% - s m 'f < 2 ||s|| 2 . 



Therefore, when c? m / is large, the main term in B(A m ,A m /) is given by cf m ||s m || + (d n 

2dm) || Sm' || ■ 

the first case 



2d m ) ||sm' || 2 - Moreover, d m is an integer dividing d m ', hence, d m = d m '/2 or d m < <im/3. In 



dm ll^mll ~i~ i,d m ' 2d m ) ||Sm'|| — ^ dm' i 



in the second case 

2 



-dm' — dm H^mll (dm' 2d m ) ||^m'|| — d m ' ( ||<5m'|| ll^mll ) — 2d m ' W^r 



3 

In particular, assuming d m ' is large enough and || s m / 1| > c\\s\\ > for some constant c, we get that 

-j- < B ( A m , A m ' ) < Ld m ' 

for some positive constant L. 
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Proof of Proposition \21\ Eq. (j85|) follows from Proposition [20] and the fact that ^A m is constant 
for histograms. We now prove Eq. (|8T|) . Recall that 

2 



sup (n Am (is)) 

If m = m', then 



(PVaV'A') 2 = d m d m Y, [PihDly)} 2 

AeA m ,A'eA ro , AeA m ,A'eA m / 



sup (n Am (ts)) 



If m ^ m', then 



E 



AeA m ,A'eA m , AeA m ,A'GA m /;/ A n/ A /^0 
It follows then from the regularity condition (186j) that 



[P(hni x ,)]< 



L 



-i d m d ri 



I l|2 ^ 



sup (n Am , (^)) 



- T d m d m > ,, ,.2 
< L sJ 



□ 



Sharpness of our concentration inequality. This section shows how our concentration result for the 
y-fold penalty (Proposition 0]) rewrites in the case of regular histogram models, so we can compare 
the deviation bounds with the variance computations (Propositions [T5l and 12 1 p . 



Proposition 22. Let S m be the space of regular histograms introduced in Section \I. 21 with 1 < 
d m < n , 1st s m be the projection of s onto S m and let {s m be the projection estimator on S m . 
Assume that || s [| ^ < i?* < oo. Then, some constant c = c(B+) exists such that the following holds: 

For all x < dU" 2 A (n/ ^/dm) 1 ^ , 

1 \ V4 X s 



P 



d 



< C 



d m \fx 



v 



> 1 - 2e~ 



Let U m , Uft^m be the U -statistics defined in Section\A\ on S m with the basis defined in Section\L 
For all x < dm" 2 A {n/ \fdm) 1 ^ , we have 



P UU m \ < c- 
For allx< \fd~m~ A (n/V) 1 / 4 , 

P ( \U B .m\ < C 



dm.\/X 



V 



V I - 

n 



1/4' 



dryiX 



n 



V d m V 



V 



nV 



1/4' 



> 1 - 2e~ 



> 1 - 2e~ 



Remark 5. We get that, for d m < n 1//2 , the deviations of the lA-fold penalty are of order y/d m x/n 
and that the dependence in V of the first-order term is proportional to 1 + V~ 1 ' 2 . This is what is 
expected by our computations of the variance, so our concentration inequalities are sharp in this 
case. When d m > n 1 / 2 , the main term in the concentration inequality is not sharp anymore. 

Proof of Proposition By Lerll| (see Proposition [28]) . there exists an absolute constant c such 
that, for all e 6 (0, 1], with probability larger than 1 — 2e~ x , 



'm °m | 



n 



< c ^e- 

13 



d„ 



v^x ^ Am x 2 



n 



e 3 n 2 



where 

v m := sup / t 2 sdfj, < and ^A m = d m . 

If d m < n 1//2 , we choose then e = s/x/d m ; if d m > n 1 / 2 , we choose e = y^l/n) 1 / 4 to conclude. 
From Lemma [H 

\ \ ny/V enVV e A n z J J 

Choose e = x/\fd m ~ if Vd m < n and e = xiV/n) 1 ^ if > n to conclude the proof. □ 

1.3. Complements on computations questions. This section gathers the proofs of the state- 
ments made in Section [6J First, we state more precisely the naive algorithm briefly discussed there 
and we prove its complexity. Then, we prove Proposition [2j 

1.3.1. Naive algorithm. 
Algorithm 2. 

Input: B some partition of {l,...,n} satisfying (|H5*[) . £ X and ( ipx )xeA m a 

finite orthogonal family of L 2 ([i), with Card(m) = d m . 
(1) For 

(a) train s m (-) with the data set that is, for all A G A m , compute oaj := 
Pn~ Bj) (if>\) = jyz^ E^ Bj V>a(&) so that = £ AeAm axjipx 

(b) compute the norm of 4 : iVj := EagA™ a \j 

(c) compute Qj := pf j) = \ EasA™ Lies, "ajV'a(^) 



(d) compute P, := P^ Bj) ^s ( m Bj) ^j = E AeAm Ei^ "A^A^i) 



(2) Compute the F-fold cross-validation criterion: C = V~ l Ej=i(^Yj ~~ ^Qi) 

(3) Empirical risk: 

(a) Train s m (-) with the data set (£i)i<i<n> that is, for all A G A m , compute ax '■= 
Pn(ip\) = £ E"=i V'a(^) so that s m = EAeA m «a^a 

(b) compute the norm of s m : N := EagA™, a A 

(c) compute P := ± EagA™ E?=i <*a^a(&) 

(4) Compute the F-fold penalty: 2? := 2(V - 1)V~ 2 Y,J=i{Qj ~ R j) 
Output: 

Empirical risk: N — 2R 

F-fold cross-validation estimator of the risk of s m : crit vfcv (m) = C 
F-fold penalty: pen V F("i) = *D 

Assuming the computational cost of evaluation ipx at some point £ G S is of order 1, the 
computational cost of this naive algorithm [2] is as follows: n(V — l)d m for step 1, V for steps 2 
and 4, nd m for step 3. So the overall cost of computing the V-fold penalization criterion for m is 
of order nVd m 

1.3.2. Proof of Proposition [H Let us first note that for every i G { 1, . . . , V} and A G A m , A^x = 
Pn yi>\)- So, at step 2, for every i,j G { 1, . . . , V}, 

Cij= £ p^Mp^M = p^ [ £ p^(^a) =p„ w) (# ) ) 

AeA m \AeA m y 
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and by symmetry dj = C j:i = P^ 3 \s^^). 

Correctness of Algorithm [TJ By assumption (|H5*|) , we have 



i v 

p - — V p ( 

3=1 

Therefore, 



1 E 



pt Bi) = \ E ^ and fc 80 = ^ E ^ 



3=1 



i<i<v 



i<i<y 



and 



1 V 

crit V FCv("i) = ^Xl^' 7 (* 
5=1 

v ^ 



3=1 



l<i,i<V 



y 2 



1 E 



3=1 



(v-i) 



1 V p( Bi 

_ 1)2 n 



V 



i<i/<y 

i&3 



V 



i^3 



/ 



I v ( p {b * 

- I) 2 ^ 

l<i,£<V \ 



V 



m 



)5> 



3=1 



v(y 



V(V 



Kii<y 



W-i) 



l<i^3<V 

:<s-r) 



(B. ) f ^(Bi) 



1 V 



)r-?{B i )^ 



l<i<V 

V -2 



v(v -i) 2 ^ \ n 



V(V-l) V(V-l) 2 



V(V-1 



i<i^e<v 
(S-T) 



V(V - 1 



(5 



1 



T 



V(V-l) (V-l 



(S-T) 



so the formula for crityFCV is correct. Lemma [JJ implies the formula for pen VF is also correct. 
Computational cost of Algorithm [TJ Step 1 has a cost of order V x d m x (n/V) = nd m . Step 2 has 
a cost of order V 2 d m . Step 3 has a cost of order V 2 . Summing the three steps yields the result. 
Computational cost for histograms. In the histogram case, step 1 can be performed with a cost of 
order Vd m + n. Indeed, one can initialize the V x d m matrix A with zeros (cost: Vd m ), and then 
go sequentially through the data set: for j = 1, . . . , n, find the unique G { 1, . . . , V } such that 
j £ &i(j)-> the unique A(j) G A m such that £j € A(j), and add (V/njipxi^j) to a(j)) • Since the 
partitions B and A m can be coded so that finding i(J) and X(j) has a cost of order 1, the resulting 
cost of step 1 is Vd m + n, hence the overall cost is of order V 2 d m + n. □ 



1.4. Proofs of technical results of the main paper. 
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1.4.1. Proof of Lemma\^ Let us first recall that, from (|H5*j) . for all K G { 1, V}, we have 

n K 1 1 1 V 

1 = 1 — — , hence 



n V l-n K /n 1-V" 1 V-l 

It follows that 

E W) = 7 E rr^%«, = i ■ 

When i^j = we have 

= v E (i-w-)' 1 " 8 '^- = E 1 = 1 + f^i ■ 

When i^j 7^ JTj, we have 



1<K<V 



V 



Hence, 

1 ^ljii^Jij 



E ((Wi - l)(Wj - 1)) = E(WiWj) - E(Wi) - E(W^) + 1 



y-i {v-if 



1.4.2. Proof of Lemma [73i From Lemma 6.8 in Lerl2bl |. for all m, m' G .M n and all u, 77 > 0, with 
probability larger than 1 — e _M , 



2 ™W+« 2 *Lm'/(9n) 



r» " * lit 1 

2 rjn 



For all x < y, (H2g) ensures that 



2 ^ Rm V Pro' / , \ ro,ro' , V 

iW m ' < $ 7= — , (x + x m + x m >) < 99- 



The triangular inequality gives 

\\SfYi <5 m /|| ^ 2||^ ^roll "t - ^ro'll — ^ ' 

n 

Take u = x + x m + x m > and rj = VM>^x + x m + x m /(P*)~ 1/4 . We have E m , m ' G X„ e" 21 "^ 
e _a; , hence, a union bound gives, for L = 2^3, 



g m„, (p n - P)( Sm - Sm ,) > Wi ^^EBp^ (v + ^)) 



< e" 



1.4.3. Proof of Lemma[TS[ For every A a C A, we define 



Za(A a ):= £ E( a ^^Aj) and Z 2 (A a ):= £ ^ (Ma,*) 

l<i,j<nAeA a l<i<nAeA a 
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First, since the £a,« are centered, and the random vectors (£a,j)aga are independent, 



(88) u[^i(A.)] = 5;Ew E Ki]= E^ 

i=l AGA a \i=l / \AeA a 

(89) and E[Z 2 (A )] = . 



Second, using repeatedly that the are centered, and the random vectors (6v,j)agA are inde- 
pendent, we get: for every Ai, A 2 C A, 

E[Z 1 (A 1 )Z 1 (A 2 )} = E (<*i,j<Xk,e®[t;x4xM',kt;x',e]) 

l<i,j,k,e<n AeAi , A'gA 2 

= E E WMtii&A) + E E {^a^uiA) 

i=l AeAi , A'eA 2 l<j^fe<n AeAi , A'eA 2 

+ E E (^J E [&,i&J&V&^]) + E E K^^A^A^A',*]) 

l<i¥=j<n AeAi , A'GA 2 l<«^j<n AgAi , A'GA 2 



(90) 




(91) 



E[Z 1 (A 1 )Z 2 (A 2 )]= Y, E (ai^A'E[^Aj6',fc]) 

l<«j',fc<n AeAi , A'GA 2 



=1 AeAi ,A'eA 2 







'o 77(2,1)" 




\ AeA a ,A'eA 6 





(92) 

E[Z 2 (A 1 )Z 2 (A 2 )]= £ £ (/3 A/ 3vE[6^ Vj ])=n £ (ft&'Ci^) 

l<i,j<n AeAi , A'GA 2 AgAi , A'GA 2 
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The result follows from the combination of ([88]), l[89|> . (j90j) . (|9Tj) and (|92| since 
cov(Z 1 (A 1 ) + Z 2 (A 1 ) , Z 1 (A 2 ) + Z 2 (A 2 )) 
= E[Zi(Ai)Z 1 (A 2 )]+E[Zi(Ai)Z 2 (A 2 )]+E[Z 2 (Ai)Zi(A 2 )] 
+ E[Z 2 (A 1 )Z 2 (A 2 )]-E[Z 1 (A 1 ) + Z 2 (A 1 )]E[Z 1 (A 2 ) + Z 2 (A 2 )] 



E 



E 

AeAi , A'eA 2 



+ 2 



E a h 



E 

AeAi , A'eA 2 



77(1,1) 

°A,A' 



E 

l<i<n 
2 



E 

KKn 



+ E 



E ^ E «a 



AeAi 



AeA 2 



, i=i 



E 

AeAi , A'eA 2 



fl 7^(2,1) 



OLii. 



E 

AeA 2 , A'eAi 



a 77(2,1) 

Pa'<^a,a' 



AeAi , A'eA 2 



77(L1) 
A, A' 



E ^ E ^ 



AeAi 



AeA 2 



E 

,i=l 



E bSJ- IE 



''A 



AeAi ,A'eA 2 



AeAi 



E wa 

AeA 2 



E 

l<i^j'<n 
n \ 



E 

AeAi ,A'eA 2 



77(1.1) 
U A,A' 



AeAi , A'eA 2 



AeAi , A'eA 2 



1.5. Extension of the results on F-fold penalties to general pseudo-regular partitions. 

Extending Theorem [T] to partitions B satisfying (|H5j) instead of (IH5*j) essentially requires to extend 
Propositions [3] and H] to pseudo-regular partitions. Then, the proof of Theorem [T] straightforwardly 
yields an oracle inequality under assumption (lH5p . 

1.5.1. Exact formula for V -fold "penalties: the general case. In this section, we extend Proposition [3] 
to partitions B satisfying (1H5|) instead of (|H5*I) . 

Lemma 23. Let V > 2, n > 4 and B satisfying (|H5|) . A function 5 : {1,...,V} 2 [-32,32] 
(defined by Eq. ([95]) ) exists such that for every m £ M. n , 

(93) 



V — 1 

2C 
(94) 



V 



pen VF (m) = \\s m - s m \\ - - {U m - U B ,m) + R P en VF (m,B), where 



,6):=^ E WO \ E W) " Pi») ] ( E (^a(^) - Pi» 



Furthermore, if (]H5*P holds, 5 = 0. 

Proving Lemma [23] requires the following lemma about the "covariance" of the weights W% . 

48 



Lemma 24. Assume that V > 2 and n > 4 and i/iai (|H5P holds. For all i G {1, •••,n}, Zei .fQ 6e 
i/ie index of the block Bk such that i £ Bx i - Then, a function 5 : { 1, . . . , V} 2 — > R exists such that 
for allij e{l,...,n}, 

(95) := E ( W - 1) W - 1), = ^ - + «| 

and 15(^,^)1 < 32 . 

Furthermore, if (]H5*P holds, 6 = 0. 

Proof of Lemma \24\ Let us first recall that, from (|H5|) . for all K £ {1, 1/}, we have 

1 — = 1 — — — — , where \t]k\ < L 

n V n 

Hence, 

1 1 v 1 _ v 

1 - n Wn ~ 1 - V" 1 - n" W ~ V - 1 1 - y ^ ~ V —1 +rK ' 
with rif := m -^^7 — . Since V > 2 and n > 4, 

V - 1 14 

l r *l ^ n(y-i)-y " n (^i)_i - 3=1 - n ■ 

Moreover, r# = when n^- = n/V. We have 

When -Kj = iTj, we have 

r 

^ E ( 1 + r ^) 2 = 1+ y 1 T + 7y Z T^ S ( 2 ^ + ^) 

1<K<V 1<K<V 



When 7^ ifj, we have 



E ( 1 + ^) 2 = 1 -n7 1 n2 + a7^2 E (2r* + 4) 



(v-1) 2 ^ (v-i) 2 (v-i) 2 

Hence, 

E {{Wi - \){Wj - 1)) = E(WiWj) - E(Wi) - E(Wj) + 1 
1 Vl^ x , 1 



V-1 (V-1) 2 (V-1) 
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Since \tk\ < 4n , V|r/d < 4, hence \2rx + V r K\ < 6|r/d- We also have 



V 



— 1 n 



K=l 



if Ki / K 



V-l 



l<if<V 



6 I rid 



, , 24(V-2) 8 32 

+ rjr< + rjc, < n \ t~\ +-< — 
[V — l)n n n 



Hence, if 



S(K i ,K j ) = n(V-l) 



(V 



-— ^ J2 ( 2rK + Vr K ) ~ y~[ ( rK * + r ^ ) 



1<K<V 



\S(Ki, Kj)\ < 32. Furthermore, if hk = n/V for every K, then = so that 5(Ki,Kj) = 0. □ 

Proof of Lemma \23l From Eq. (|37p , we have 
2C 

pen VF (m) = — E E E [(Wi — l){Wj — 1)] (ip\(Xi) — Pifi\ ) (ipx(Xj) — Pip\ ) . 

AeAm l<i,i<n 

Thanks to Lemma [Ml we deduce that 

penvF(m) - ^TvE E (^(i ! )-^a)(«)-^a) 



2C 



n 2 (F- 1) 



AeA m l<i,j<n 



V 1 



v 



E E E - PtPx) (M*j) - PtPx) 

XeA m K^K'=lieB K ,jeB K , 



AeA m l<ij'<n 

1 2 ^ 

Sm|| 777 772 ( ^ m ^B,m) 



(F-l)s 



+ 



-3-L-- e E w^) ( E Wxm-PM ) ( E (Mxa-pm 

V ' AeA m JsT,.fir'=l \i£0K / \jeB A ., 



□ 



1.5.2. Concentration of V -fold penalties : the general case. In this section, we extend Proposition [J] 
to partitions B satisfying (|H5|) instead of (|H5*j) . 



Proposition 25. Let V > 2, n > 4 and B satisfying (|H5p . For every m £ -M nj Zei pen VF (m) 6e 
the V -fold penalty defined by Eq. ^ on a linear S m satisfying Eq. (IH1I) and (IH4j) . Zei 2 < a; < 



(96) 



AyV^R*) 1 ' 4 and fei 
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+ V 1 / 3 + 



There exists an absolute constant L such that, 
(97) 1 



V -I 2 

pen VF (m,B) - 2 ||s m - s m || 



> Le^n, V, x) 



Rr, 



n 



< 2e 



2-x 



Proof of Proposition \25\ Propositionl25l follows from Lemma l23l and concentration results for all the 
terms appearing in Lemma l23l Lemma[TT]for U m , Lemma[26]for Ujg,m, and finally, for i? penvF (m, B), 
Lemmas [27] and Eq. (|35j) imply that 



R peilyF (m,B)\ > L 



C+x 3 R ri 



n n 



< e 



2-x 



□ 



Under (|H5p . Lemma [9] implies the following concentration inequality for U Bm . 



Lemma 26. Let £i :Tl be i.i.d random variables and let S m be a linear space satisfying Assumptions 
(|Hip . (|H4|) , For all m G M. n , let (ip\)\eA m be an orthonormal basis of S m . Let (Bk)k=\,...,v be a 
partition of { 1, n} satisfying (IH5j) , Let 



e 5 (n,V,x) 



v 



v 



Rt 



■n 



n 



An absolute constant L > exists such that, for all m G M n and 2 < x < C^ 2 (i?*) 1 ^ 2 A 



F \U B ,m\ > Le 5 (n,V,x) 



C+X 1 ' 2 R r , 



< 2e" 



(^) 1/4 n J 

Proof of Lemma\j3& Under the pseudo regularity assumption (|H5p . we have nx < n/V + 1, hence 

V V 



K=l 



K=l 



V 



Therefore, 



1 

n 



V 



\ K=\ 



V n 



From Lemma [9] with v = ^/x(R^) 1//4 , lZ m = C*R m , we deduce that there exists an absolute 
constant L such that, for all e > satisfying e^fx < 1, 



P \U B , m \ < C*Lx 



Rr, 



n 



e + 



i Vv^ 

V ne 3 



> 1 - 2e~ x . 



Let e = U 1 / 6 !/ Ax 1 / 2 , we have 

v 2 e v VV 1 \ x(Rt ) 1 ^ 4 
V < - — < V < — - — — 

eVV W ~ V 1 / 3 ' ne 3 ~ nv 3 \ 



n 



1 Rt 



x n 



V 



x(Rt 



,1/4 



71 



□ 



The concentration of the remainder term follows from the following lemma. 
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Lemma 27. Let £i,...,£ n be i.i.d. Let S m be a linear space of function and let (i(^x)xeA m be an 
orthonormal basis of S m . Let (BkSk=\ ^ e a partition of {1, ...,n} and for all K = 1,...,V, let 
nx = Card(£>x)- Let (6(K, K'))k,k'=i,...,v be a family of real numbers, bounded by 5* and let 



V 



E E wo E(^(^)-^ 



AeA m K^K'=l 



xeAm k=i 



Some absolute constants C\,C% exist such that, for all x > 0, we have 



(98) 
(99) 



2 „2 



r 



Zi| < Cm** L> m x + v^aT + — ftlx 3 ) | > 1 - e 



.2 ™3 



n 



„2-z 



Z 2 | < C 2 n5* \D m ^+ v 2 m x 3 / 2 + Y-b 2 m x 5 /A) > 1 - e 2 "* 



Proof of Lemma\27[ For all A G A m , for all K = 1, V, let 

Zx.K = Y (^ApQ) ~ P ^a) and Z K = {Z x ,K)x e A ri 



ieB K 



Proof of Eq. (198)) . The random variables (Zk)k=i,..,v are independent. From BBLM051 . Theorem 2] 
(recalled by Lemma I52"j) . for all q > 2, we have 



IZiII < 2a/c a 



Y (Zi-E[Zi| (Z*/ 



A=l 



3/2 



As the Z,\ x are centered, we deduce that 



v 



Zi-E[>i| (ZtfOtfV*"] = E E S(K,K')Z x ,kZx,k' ■ 

XeAm K'=l, K'^K 

Hence, from the triangular inequality and Cauchy-Schwarz inequality 



\Z 1 \\<2-^/q 



V 


( 


E 




K=\ 


\ 



1/2 



1/2' 



^ |5(ir,^)l E z i 



A 



E Z A,A-' 



AeA„ 



AeA„ 



K'=l, K'^K 

Using the inequality 2ab < no? + ry -1 6 2 and the triangular inequality, we deduce that 



3/2 



H3l II, < V^Q 
(100) 

< y/2cq 



i 



V 



E 7 ? 

K=l 







/ 


( E 3u) 






\AeA m / 


3/2 


V 



V 



1/2' 



£ 1^,^)1 ( E z Ik> 

K'=l, K'^K 



AeA„ 



3/2 



\ 



E 7 ? 

K=l 





2 






AeA m 


3 



V 



1/2 



£ |5(k,koi E z Ik< 



K'=l, K'^K 
52 



AGA„ 



23 



The random variables \S(K, K') \ ( ^AeA m Z x K' ) 1//2 being independent, we can use B BLM05I . The- 
orem 2] (see Lemma |33|) to obtain 

4 



1/2 



E iwni E * 



2 



K'^K 



AeAr, 



2q 



< 16cV [ 8(K,K'f 

K'=l, K'^K 



2 

X,K> 



xeA„ 



Taking V = 4c q Z V K, =1 ,K^KHK,K'y ZxeA m Z l,K> J ExeA m Z KK 
that 



in Eq. (jlOOp . we deduce 



Zi\\ q <4eq 



(101) || 
By definition of Z\ K , we have 













E t(K,K>y 


E Z \K> 




E 




K^K'=l 


AeA m 


9 


AeA m 





E Z \K = n\{ sup(Pf-)-P)A 

AeA m VteBm 7 



From |Lerll| (recalled by Proposition I28p . we have then, for all x > 0, 



E z \K-n K D n 



AeAr, 



_ f 2 x 6 2 x 2 

> c I en K D m + n K ^^ + -^5- 



<2e~ x . 



It comes by integration (see Lemma [3T1 for detailed computations) that 



(102) 



E Z \K - n KD r , 

AeA m 



J fm? , b 2 m q 2 



< c en K D m + 2e i n K -^- + 



Plugging this inequality in Eq. (jlOip . we obtain that there exists an absolute constant C such that 
\\Zi\\ q <4cqD Til \ 



S(K,K') 2 n K n K/ 
K^K'=l 



,2 „2 



+ C\ [ eD m q + 



n K n K' + 



b 2 a 3 



v 



E S ( K > K ') 



Using |Arl07l . Lemma 8.10] (recalled by Lemma l30|) . we obtain that there exists an absolute constant 
C\ such that, with probability 1 — e 2 ~ x , 



\Zi\ < Ci (D m x + v 2 m x 2 ) 



V 



K^K'=1 



n K n K i + ft^x 3 

K^K'=l 

Proof of Eq. (|99p. By independence of the random variables S(K) X^AeA Z \ki f rom BBLM05I . 
Theorem 2] (see Lemma [33]) . we have 



|Z 2 -E(Z 2 )|| q <2yfiy/q 







E w 




a:=i 


AeA m 
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From Eq. (|102j) . we have 



2 

X,K 



< c(n K (D m + v m q) +b m q ). 



Hence, 



\Z 2 -E(Z 2 )\\ q <Cy/q 



v 



Y J KK) 2 nl{D m +v 2 mq )+b 2 m q 2 
\ K=l 



x E ^ K f 

\ K=l 



Using Arid 71 . Lemma 8.10] (recalled by Lemma [30]) . we obtain that an absolute constant C 2 > 
exists such that, with probability 1 — e 2 ~ x , 



\Z 2 -E(Z 2 )\ < C 2 y^ 



V 



S(K) 2 n 2 K (D m + v 2 m x) + b: 



\ ^ 

\ K=l 



2 2 
X 



x E w 

\ K=l 



We conclude the proof with the inequality |E(Z2)| < Y1k=i n K\S(K)\D m < 5*D m n. 

1.6. Probabilistic Tools. This section recalls several probabilistic tools (most of them classical) 
that are used several times in the proofs. Firs t, we r ecall a concentration inequality obtained in 
Theorem 4.1 in the supplementary material of Lerll| . 



□ 



Proposition 28 ( [Lerlll ]). Let £i : jv be iid random variables valued in a measurable space (X, A 7 ), 
with common distribution P. Let S be a symmetric class of functions bounded by b. For all t G S, 
let P N t = N- 1 v 2 = sup tes P[(t - Pt) 2 ], Z = sup teS (P N - P)t, D = NE{Z 2 ). For all 

x > and all e E (0, 1], with probability larger than 1 — 2e~ x , 



N 



D 1 / v 2 x 



< L e— + - 



N e \ N 



+ 



bx \ ' 

In J 



The constant L = (16(ln2) 2 + 8) works. In particular, if R > and r] > satisfy 

b 2 

D< R, v 2 < n 2 R, — < rj 4 R, 
taking e = n^/x in the previous inequality yields, for all x such that r]y/x < 1, 



D 

N 



R 

< 3Ln^— ) > 1 - 2e~ x . 



In particular, Proposition 1281 implies the following. 

Corollary 29 (Corollary 4.3 in the supplementary material of Lerll| ). Let £ 1: jv be i.i.d random 
variables valued in a measurable space (X, with common law P. Let fi be a measure on (X, X) 
and let (tx)\ & \ m be a set of functions in L 2 ([i). Let 



D 



j * = E a ^ E a l ^ 1 \ ' D = E ( su p(*(^i) - pt ) 2 ) 

{ \eA m AeA m J ^ teB ' 



v 2 = sup P[(t - Pt) 2 ], 6 = sup 



teB 



teB 



Let U be the following U -statistics 



U 



N(N 



1 N 

r—^ 52(t x (&-pt x )(t x (t j )-Pt\). 
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For all x > and all e E (0, 1], with probability larger than 1 — 4e x , 



\U\ <L'\e- + 



D 11 v 2 x ( bx 



N e \ N 



+ 



eN 



The constant U = 2(L + 4(ln2) x ) works, where L is defined in Proposition \28l In particular, if 
R > and rj > satisfy 

b 2 

D <R, v 2 < i] 2 R, — < r] A R, 
P 

taking e = r)s/x in the previous inequality yields, for all x such that rjy/x < I, 

P\\U\< 3L' V V^^ > 1 - 4e~ x . 

We now recall some lemmas proved in [Arl07], about the links between moment and concentration 
inequalities. 

Lemma 30 (Lemma 8.10 in [Arl07]). Let X\, An > 0, fi±, [an > and £ be a random variable 
such that for all q > qo > 0, 

N 



u\\ q <J2 x ^ ■ 



i=l 



Then, for every y > 0, 



N 



iei>E 



i=i 



A; 



ey 



mm, 



■j H 



< e 9omin 3 {Atj} -y 



We give a little generalization of Lemma 8.12 in |Arl07 ] 

Lemma 31. Let a%, ajv > 0, aj, a^v > 0, 6 > and ^ be a random variable such that 

P[|£| > sup a i2 / a * ] <6exp(-y) . 
V i=i,...jv / 



Then, /or every q > max(a j ) V 1, 



N 



i=l 



Proof of Lemma \3J 



P(|£| > y l ' q )dy < b 



uni=i,...,N(y 1/q /a i ) 1 / a i 



dy 



N 



< 



i=i 



-( 2 / 1 /9/a i ) 1/a< 



dy = bY] a\qa.i \ t^^e^dt. 
i=i ^ 



Now, from the inequality 



o 



P 



V.? > .1 . /"* t^e-'dt <e(^) VP , 



since, for all i, gai > 1, we deduce 



A? 



t=l 
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Figure 4. Oracle model for some sample generated according to L. Left: Regu. 
Right: Dya2. 



Hence, since q > 1, and q l / q < e, 



lien, < (^'-Y.a^fM (ssy < be ^ a , (S)~«% 



i=l 



□ 



Finally, we recall two m oment inequalities that are corollaries of BBLM05, Theorem 2] (see also 
Lemmas 8.17 and 8.18 in |Arl07t | . respectively). 

Lemma 32 (Corollary of Theorem 2 of (bBLMqI]). Let £i : jv be N independent random variables, 
let f be a measurable function M. N — > R and 

Z = f(Zl:N) ■ 

There exists c < 1.271 such that, for every q > 2, 



\Z-E{Z)\\ q < l-Jly/q 



N 



£(Z-E[Z|fo);*]: 



i=i 



?/2 



Lemma 33 (Corollary of Theorem 2 of [BBLMO^]). Let £i : jv &e iV independent random variables 



admitting q-th moments for some q > 2. Let S = Then, there exists c < 1.271 such that 

\\S\\ q < 2yfa/q> 



N 



\ E 116 

\ i=l 



1.7. Additional simulation results. This section provides simulation results in addition to the 
ones of Section [5j First, Table [3] is an extended version of Table [21 with more procedures compared 
and two additional settings (L-Regu and S-Regu). Second, Figure 0] is an analogous of Figured 
in setting L, that illustrates the difference between the model collections Regu and Dya2. Third, 
Figure is an analogous of Figure in setting L, that illustrates how the variance of l^-fold criteria 
depends on V. 
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Table 3. Simulation results: settings L and S. The best procedures (up to standard- 
deviations) are bolded, where the data-driven procedures are considered separately 
from the procedures using the knowledge of E[pen id ]. 



Experiment 


L-Dya2 


L-Regu 


S-Dya2 


S-Regu 


E[pen id ] 
1.25 x E[pen id ] 
1.5 x E[pen id ] 
2 x E[pen id l 


6.62 ±0.18 
4.78 ±0.12 
4.13 ±0.09 
3.66 ±0.06 


2.35 ±0.05 
2.04 ±0.03 
1.92 ±0.02 
1.97 ±0.02 


2.09 ± 0.03 
1.95 ±0.02 
1.93 ±0.02 

2.02 ±0.02 


1.76 ± 0.02 
1.63 ±0.01 
1.66 ±0.01 
1.83 ±0.01 


Cp 

1.25 x C p 
1.5 x C p 

2 x C p 


8.52 ±0.24 
6.10 ±0.17 
4.97 ±0.12 
4 38 ± 09 


2.35 ±0.05 
2.03 ±0.03 
1.92 ±0.02 

1 97 ± 02 

.1.- /1 _l_ \J >\J JLJ 


3.26 ± 0.04 
3.04 ±0.04 
3.01 ±0.04 
3 18 ± 03 


1.76 ±0.02 
1.64 ±0.01 
1.66 ±0.01 
1 83 ± 01 

1 _1_ \J.\J±. 


pen LO o 

1.25 x pen LOO 

1.5 x pen LOO 


6.41 ±0.18 
4.65 ±0.12 
4.01 ±0.09 
3 61 ± 06 


2.35 ±0.05 
2.03 ±0.03 
1.92 ±0.02 

1 97 ± 02 


2.08 ± 0.03 
1.93 ±0.02 
1.91 ±0.02 

1 99 ± 02 


1.76 ± 0.02 
1.64 ±0.01 
1.66 ±0.01 
1 83 ± 01 


pen VF (V=10) 
1.25 x pen VF (V=10) 
1.5 x pen VF (V=10) 
2 x npTi,m (V— 10) 


6.76 ±0.17 
4.96 ±0.12 
4.28 ±0.10 
3 66 ± 06 


2.44 ±0.05 
2.05 ± 0.04 
1.93 ±0.02 
1 92 ± 02 


2.14 ±0.03 
1.96 ±0.02 
1.91 ±0.02 

1 95 ± 02 


1.78 ± 0.02 
1.62 ±0.01 
1.64 ±0.01 
1.77 ± 0.01 


pen VF (V=5) 
1.25 x pen VF (V=5) 
1.5 x pen VF (V=5) 
2 x penw (V=5) 


7.53 ±0.19 
5.50 ±0.13 
4.65 ±0.11 
3.80 ± 0.07 


2.60 ±0.06 
2.15 ±0.04 
1.96 ±0.03 
1.94 ± 0.02 


2.21 ±0.03 
2.00 ± 0.02 
1.95 ±0.02 
1.98 ± 0.02 


1.80 ±0.02 
1.63 ±0.01 
1.61 ±0.01 

1.72 ± 0.01 


pen VF (V=2) 

1 25 x nernn. (V=2) 
1.5 x pen VF (V=2) 

2 x pen VF (V=2) 


10.27 ±0.24 
7.77 ± 0.19 
6.41 ±0.16 
5.12 ±0.12 


3.22 ±0.09 
2 41 ± 05 
2.18 ±0.04 
1.94 ±0.03 


2.46 ± 0.03 
2 23 ± 03 
2.10 ±0.02 
2.06 ± 0.02 


2.02 ± 0.03 
1 73 ± 02 

1.63 ±0.01 

1.64 ±0.01 


LOO 

10-fold CV 
5-fold CV 
2-fold CV 


6.41 ±0.18 
6.25 ±0.16 
6.27 ±0.16 
6.41 ±0.16 


2.35 ±0.05 
2.34 ±0.05 
2.28 ±0.05 
2.18 ±0.04 


2.08 ±0.03 
2.07 ±0.03 

2.09 ± 0.03 

2.10 ±0.02 


1.76 ±0.02 
1.71 ±0.02 
1.68 ±0.02 
1.63 ±0.01 


Oracle: 10" 3 x 
Best: 10~ 3 x 


5.49 ± 0.06 
19.82 ± 0.33 


13.28 ±0.16 
25.45 ± 0.28 


43.91 ±0.29 
83.70 ± 0.67 


62.66 ±0.39 
101.00 ± 0.76 



CNRS ; Sierra Project-Team, Laboratoire d'Informatique de l'Ecole Normale Superieure, (CNRS/ENS/INRIA 
UMR 8548), INRIA - 23 avenue d'Italie - CS 81321, 75214 PARIS Cedex 13 - France 
E-mail address: sylvain.arlot@ens.fr 

Laboratoire J. A. Dieudonne, CNRS UMR 6621, Universite de Nice - Sophia Antipolis, 06108 Nice 
Cedex France 

E-mail address: mlerasle@unice.fr 
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Figure 5. Visualization of variances as a function of V in experiment L-Regu. 
Right: zoom of the left part and "selectable models". 
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